Empirical Study and the Classical Linear Regression Model

EMPIRICAL STUDY AND THE CLASSICAL LINEAR REGRESSION MODEL

STEPS IN AN EMPIRCAL STUDY

An empirical study has 6 steps. 1) Objective. 2) Data. 3) Get to know the data. 4) Statistical model. 5) Estimation, hypothesis testing, and goodness-of-fit. 6) Conclusions.

OBJECTIVE

The first step in an empirical study is to clearly state the objective of the study. An empirical study uses a sample of data to make conclusions about an economic relationship in a population. The objective is either to explain the relationship or use the relationship to make predictions. If the objective is to explain the relationship between a dependent variable, Y, and one or more explanatory variables, X1, X2, …, Xk, then the following questions are typically addressed. 1) Do X1, X2, …, Xk have independent causal effects on Y? 2) If so, what are the direction of the effects? 3) What are the size of the effects? 4) Which ones have the biggest and smallest effects?

DATA

The next step is to obtain a sample of data. A sample of data is a subset of units from a population. The data can be either cross-section data, time-series data, or panel data. Cross-section data is data on two or more units for the same time period. Time series data is data for the same unit for two or more time periods. Panel data is data on two or more units for two or more time periods, where the units are the same for each time period.

GET TO KNOW THE DATA

The next step in an empirical study is to get to know the data. To get to know the data, you do two things. 1) Get to know your variables. 2) Organize, summarize, and describe the data.

Get to Know Your Variables

For each variable is it continuous or discrete, quantitative or qualitative? Qualitative variables are quantified by one or more dummy variables. How is each variable defined and measured?

Organize, Summarize and Describe the Data

To organize, summarize, and describe the data you construct histograms of frequency distributions and calculate descriptive statistics.

STATISTICAL MODEL

The next step in an empirical study is to specify a statistical model. A statistical model describes the process that generated the sample data. The first statistical model that we will develop in this class is the classical linear regression model.

Specification of the Model

To specify a regression model, you must specify an equation that describes the statistical relationship between the dependent and explanatory variables. This can written in general functional form as Yt = ƒ(Xt1, Xt2…Xtk) + μt

Xt1 = 1 for each unit, so it plays the role of a constant. The economic relationship has a dependent variable (Y) and k – 1 explanatory variables (Xt2…Xtk). All factors other than Xt2…Xtk that affect Y are included in the error term, μ. The error term is always unknown and unobservable. To specify a statistical model you do 3 things.

1. Chose the variables. 2. Make an assumption about the functional form of the relationship. 3. Make assumptions about the error term.

These choice of variables and assumptions define the specification of the regression model.

Variables

There are two types of explanatory variables. 1) Variables of interest. 2) Control variables. A variable of interest is one about which you want to draw conclusions. A control variable is a potential confounding variable. A confounding variable is any variable that has an effect on the dependent variable and is correlated with a variable of interest. It is necessary to include any relevant control variables to obtain unbiased estimates of the causal effects of the variables of interest.

Functional Form

The first assumption underlying the classical linear regression model is the following.

1. The functional form is linear in parameters: Yt = 1Xt1 + 2Xt2 + … + kXtk + t for t = 1, …, T

The most often used linear in parameters functional forms are the linear in variables, double-log, log- linear, and quadratic. The marginal effect of Xi on Y and the elasticity of Y with respect to Xi are given below. The semi-elasticity is also given for the log-linear functional form.

Functional Form Marginal Effect Elasticity Semi-elasticity

Linear in variables βi βi(Xi/Y) Double-log βi(Y/Xi) βi Log-linear βiY βiX βi Quadratic β6 + 2β7X6 (β6 + 2β7X6)(X/Y)

All variables are evaluated at the mean. The quadratic is for a quadratic relationship between the dependent and explanatory variable.

Error Term

The error term is a random variable that represents the “net effect” of all factors other than the explanatory variables that affect the wage of the tth worker. We describe the behavior of the random variable μ by a conditional probability distribution: ƒ(μ| X1,X2,X3,X4). For each set of values of the X’s there is a probability distribution for μ. The classical linear regression model makes the following assumptions about the error term.

2. The error term has mean zero: E(t) = 0 for t = 1, 2, …, T 2 2 3. The error term has constant variance. Var(t) = E(t ) =  for t = 1, 2, …, T 4. The errors are uncorrelated. Cov(t,s) = E(t  s) = 0 for all t  s

5. The error term has a normal distribution. t ~ N for t = 1, 2, …, T

6. The error term is uncorrelated with each explanatory variable. Cov(t,Xti) = E(t  Xti) = 0.

Additional Assumption

In addition, the classical linear regression model often times is based on one additional assumption.

7. The explanatory variables are nonrandom variables.

ESTIMATION, HYPOTHESIS TESTING, GOODNESS OF FIT

The next step in an empirical study involves estimating the parameters, testing any meaningful hypotheses, and calculating a measure of goodness-of-fit.

Estimation

For the classical linear regression model, there are K+1 parameters to estimate: K regression coefficients 2 1, 2 … k, and the error variance (conditional variance of Y)  . You need to choose an estimator for the regression coefficients and the error variance.

Choosing an Estimator for 1 , 2 … k

To obtain estimates of the parameters of the model, you need to choose an estimator. To choose an estimator, you choose an estimation procedure. You then apply the estimation procedure to the statistical model. This yields an estimator. In econometrics, the estimation procedures used most often are:

1. Least squares estimation procedure 2. Maximum likelihood estimation procedure

Least Squares Estimation Procedure

When you apply the least squares estimation procedure to the classical linear regression model you get the ordinarily least squares (OLS) estimator. The least squares estimation procedure says to choose as estimates of the unknown parameters those values that minimize the residual sum of squares function for the sample of data. For the classical linear regression model, the residual sum of squares function is

^ ^ ^ ^ ^ ^ 2 RSS(1 , 2 … k ) = (Yt - 1 - 2 X12 - … - k X1k)

The first-order necessary conditions for a minimum are k partial derivative functions set equal to zero. These are called the normal equations. Solving this systems of equations yields the OLS estimator.

Properties of the OLS Estimator

An estimator is a random variable. Its behavior is described by a probability distribution. The probability distribution of an estimator is called a sampling distribution. It describes the behavior of an estimator in a large number of samples. To assess the properties of the OLS estimator, you need to derive the sampling distribution. To derive the sampling distribution, you need to derive the form, mean, and variance.

Form of the Sampling Distribution of the OLS Estimator. The OLS estimator has a multivariate normal sampling distribution. This follows directly from the assumption that the error term has a normal distribution.

Mean of the OLS Estimator

To derive the mean of the OLS estimator of βi, you need to make two assumptions: 1. The error term has mean zero. 2. The error term is uncorrelated with each explanatory variable. If these two assumptions are satisfied, then it can be shown that the mean vector of the OLS estimator is

^ E(i ) = i for i = 1, …, K

That is, the mean of the OLS estimator is equal to the true value of the population parameter being estimated. This tells us that for the classical linear regression model the OLS estimator is an unbiased estimator.

Variance-Covariance Matrix of Estimates

The variance-covariance matrix of estimates gives the variances and covariances of the sampling distributions of the estimators of the K parameters. To derive the variance-covariance matrix of estimates, you need to make four assumptions: 1. The error term has mean zero. 2. The error term is uncorrelated with each explanatory variable 3. The error term has constant variance. 4. The errors are uncorrelated.

For the classical linear regression model, it can be shown that the elements in the variance-covariance matrix of OLS estimates is less than or equal to the corresponding elements in the variance-covariance matrix for any alternative linear unbiased estimator; therefore, for the classical linear regression model the OLS estimator is an efficient estimator.

Summary of Small Sample Properties

For the classical linear regression model, the OLS estimator is the best linear unbiased estimator (BLUE) of the population parameters.

Summary of Large Sample Properties

For the classical linear regression model, the OLS estimator is a consistent estimator.

Maximum Likelihood Estimation Procedure

When you apply the maximum likelihood estimation procedure to the classical linear regression model you get the maximum likelihood estimator. The maximum likelihood estimation procedure says to choose as estimates of the unknown parameters those values that maximize the likelihood function for the sample of data. The likelihood function is the joint probability function for the sample of T observations. However, once you have the sample, the values of Y and the X’s are unknown, but the values of the β’s are unknown. The likelihood function is a function of the unknown β’s. Because you choose the values of the β’s that maximize the likelihood function, your sample is more likely to come from a population with these parameter values than any other parameter values. For the classical linear regression model, the maximum likelihood estimator is the same as the OLS estimator.

Choosing an Estimator for  2

To obtain an estimate of the error variance, the following estimator is the preferred estimator,

RSS 2^ =  T – k

Estimating the Variance-Covariance Matrix of Estimates

The true variance-covariance matrix of estimates, is unknown. The is because the true error variance 2 is unknown. Therefore, the variance-covariance matrix of estimates must be estimated using the sample of data. To obtain an estimate of the variance-covariance matrix, you replace 2 with its estimate 2^ = RSS / (T – K).

Hypothesis Testing

The following statistical tests can be used to test hypotheses in the classical linear regression model.

1. t-test 2. F-test 3. Likelihood ratio test 4. Wald test 5. Lagrange multiplier test

You must choose the appropriate test to test the hypothesis in which you are interested.

Goodness-of-Fit

If your objective is to use the explanatory variable(s) to predict the dependent variable, then you should measure the goodness of fit of the model. Goodness-of-fit refers to how well the model fits the sample data. The better the model fits the data, the higher the predictive validity of the model, and therefore the better values of X should predict values of Y. The statistical measure that is used most often to measure the accuracy of a classical linear regression model is he R-squared (R2) statistic.

CONCLUSIONS

If the objective of the study is to analyze the relationship between Y and X1, X2, …, Xk, then in the conclusion you should address 4 questions.

1. Do X1, X2, …, Xk have independent causal effects on Y? 2. What is the direction of the effects? 3. What is the size of the effects? 4. Which ones have the biggest and smallest effects?

Do X1, X2, …, Xk have independent causal effects on Y? There are two approaches you can address this question. 1) Hypothesis test approach. 2) Strength of evidence approach. When using the hypothesis test approach, you choose a level of significance and test the null hypothesis that a variable has no effect. If you reject the null, you can answer the question with a probability statement. For example, if you choose a 5% level of significance and reject the null hypothesis, then you can be 95% confident that X has an effect on Y. When using the p-value approach, you interpret the p-value for the zero null hypothesis as a measure of strength of evidence of an effect. The p-value is the probability that the effect you observe in your sample is the result of random sampling error or chance, and not a real effect in the population. The smaller the p-value, the stronger the evidence of an effect. For example, a p-value of <0.01 indicates that the evidence for an effect is very strong. The probability that the effect you observe is the result of chance is 1%. These are some general rules-of- thumb for assessing the strength of evidence of an effect for economic data.

Strong evidence: p-value ≤ 0.05 Moderate evidence: 0.05 < p-value ≤ 0.10 Weak evidence: 0.10 < p-value ≤ 0.20 Little or no evidence: 0.20 < p-value

It is important to note the following. If you use the hypothesis test approach, choose a level of significance of 0.05, and reject the null, then you can concluded that you are 95% confident that a variable has an effect. However, if you accept the null, then you conclude that there is insufficient evidence that the variable has an effect. You cannot say how confident you are that the variable does not have an effect, because you never know the probability of making a type II error (concluding the variable has no effect when it does have an effect).

What is the Direction of the Effect?

The determine the direction of an effect, you use the algebraic sign of the coefficient estimate. If the sign is positive, then X has a positive effect on Y. If the sign is negative, then X has a negative effect on Y.

What is the Size of the Effect?

To assess the size of an effect, you interpret the point or interval estimate of a parameter.

Which Ones Have the Biggest and Smallest Effects?

To compare the size of the effects of variables that have different units of measurement, you need a unit- free measure of effect. In economics, the preferred unit-free measure of effect is elasticity.

Validity of Conclusions

To assess the validity of the conclusions we use two criteria.

1. Internal validity 2. External validity

Internal Validity

You empirical study is internally valid if the conclusions about the causal effects are valid for the population that you studied.

Criteria for Internal Validity To assess internal validity, you must address two questions.

1. Are the estimates of the parameters unbiased? 2. Are the estimates of the standard errors of the estimates of the parameters unbiased?

If you estimates are biased, then they are systematically too high or low, and you can’t have much confidence in your conclusions. If your standard errors are biased, then any inferences you draw may be incorrect.

Unbiasedness of Parameter Estimates

The two most important potential sources of bias are omitted confounding variables and reverse causation.

Unbiasedness of Standard Error Estimates

The two most important sources of biased standard errors are heteroscedasticity and autocorrelation.

External Validity

An empirical study is externally valid if the conclusions about the causal effects can be generalized from the population studied to the population of interest. An empirical study is externally valid if there are not big differences between the population and setting being studied, and the population of interest.