<<

PBAF 528 Week 2

A. More Simple ( Regression) From a sample we estimate the following equation:

ˆ ˆ ˆ Yi = b 0 + b1X i + ei

Ordinary Least Squares (OLS) gives estimates of model parameters. Þ Minimizes the sum of the square distances between the data points and the fitted line.

Þ Coefficients weight “outliers” more, since residuals are squared.

Þ Precision of these estimates (SE) depends on sample size (larger is better), amount of noise (less is good), amount of variation in explanatory factor (more is good).

Þ The fitted line must pass through X, Y

ˆ Þ b1 is the slope of the fitted line

ˆ Þ b0 is the intercept, the expected value of Y when X=0

PBAF 528 Spring 2005 1

from AH Studenmund. 1997 Using Econometrics: A Practical Guide

Why does the line pass through ( X , Y )?

How many lines pass through the bivariate mean?

æ Ù ö What is the value of Sç y - y÷ ? Why? è ø

2 æ Ù ö The least squares model finds the line that makes Sç y - y÷ a minimum. è ø

PBAF 528 Spring 2005 2 Variance in Y

2 æ Ù ö Sç y - y÷ is the residual (unexplained) sum of squares (SSE) è ø (SPSS refers to this as the Residual Sum of Squares)

2 æ Ù - ö Sç y- y÷ is the è ø

(SPSS refers to this as the Regression Sum of Squares)

Total Sum of Squares = Explained SS + Residual SS

SSyy = SSyy- SSE +SSE (your book) n n n 2 2 2 å(yi - y) =å(yˆ i - y) + å(yi - yˆi ) i=1 i=1 i=1

SSyy = SSregression + SSE

The (OSL) model minimizes the sum of squared errors (SSE) and therefore maximizes the explained sum of squares

Example 1: Assignment #1 part II 1(a)

B. Assumptions - Straight Line Regression Model

The straight-line regression model assumes four things: Þ X and Y are linearly related

Þ The only randomness in Y comes from the error term, not from uncertainty about X

Þ The errors, e, are normally distributed with mean 0 and variance s2

Þ The errors associated with various datapoints are uncorrelated (not related to each other, independent)

PBAF 528 Spring 2005 3 C. Estimator of s² s² measures the variability of the random error, e.

As s² increases the larger the error in the prediction of y using y

s² is an estimate of s²

s² = SSE = SSE df n – 2

s is the estimated standard error of the regression model

s measures the spread of the distribution of y values about the least squares line

most observations should be within 2s of the least squares line

D. Interpreting results Goodness of Fit – Assessing the Utility of the Model

1. Coefficient of Determination

The smaller SSE is relative to SSyy, the better the regression line appears to fit (we are explaining more of the variance).

We can measure “fit” by the ratio of the explained sum of squares to the total sum of squares. This ratio is called R2. æ 2 ö n 2 ( y) æ ( x)( y)ö 2 ç å ÷ ˆ ç å å ÷ ˆ å y - - b1 å xy- å(yi - yi ) ç n ÷ ç n ÷ 2 SSyy - SSE SSE i=1 è ø è ø R = =1- =1- n = 1- 2 SS SS 2 yy yy 2 ( yi ) (yi - y) å å åyi - i=1 n

Þ R2 must be between 0 and 1.

Þ The higher the R2 the better the fit—the closer the regression equation is to the sample data. It explains the proportion of variation explained by the whole model.

Þ R2 close to 1 shows an excellent fit.

PBAF 528 Spring 2005 4 Þ R2 close to 0 shows that the regression equation isn’t explaining the Y values any better than they’d be explained by assuming no relationship with X’s.

Þ OLS (that is, minimizing the squared errors) gives us b’s that minimize RSS (keeps the residuals as small as possible) thus gives largest R2

Þ There is no simple method for deciding how high R2 has to be for a satisfactory and useful fit.

Þ Cannot use R2 to compare models with different dependent variables and different n

Þ R2 is just one part of assessing the quality of the fit. Underlying theory, experience, usefulness all are important

Example 2: Assignment #1 Part III 1(a)

Note: The relationship between the correlation coefficient and coefficient of determination · For a simple regression r2 (the square of the simple correlation coefficient, r) is approximately equal to R2 (the fraction of variability in Y explained by regression).

Sampling distribution of parameter estimates

ˆ E(b1 )= b1 The expected value of a coefficient is the true value of that coefficient.

ˆ SD(b1 )= SE ˆ = s ˆ The standard deviation of the estimate is the b 1 b1 standard error.

Precision of estimates (SE) depends on: Þ Randomness in outcomes (s2) (less is better) Þ Size of sample (more is better) Þ Variation in explanatory variable, X (more is better)

Errors are normally distributed (an assumption of the model); therefore, so are the parameter estimates. So, t-tests, confidence intervals, and p-values work for coefficient estimates.

PBAF 528 Spring 2005 5 2. Hypothesis test of one coefficient, b 1

Step 1: Set up hypothesis about true coefficient

H0: b1=0 Ha: b1¹0 Step 2: Find test statistic Tells us how many standard errors away from zero the coefficient is. bˆ - b t = 1 H0 SE ˆ b1 SE usually obtained from SPSS or Excel.

Can Calculate SE: 2 æ ( y) ö æ ( x)( y)ö ç y2 - å ÷ - bˆ ç xy- å å ÷ ˆ çå n ÷ 1çå n ÷ SS yy - b1SSxy è ø è ø s n - k -1 n - k -1 SE ˆ = = = b1 SS SS 2 xx xx æ ( x) ö ç x2 - å ÷ çå n ÷ è ø

Step 3: Find critical value

ta (the critical t) can be found in the t table with n-k-1 degrees of freedom (n=sample size, k=number of explanatory variables)

Step 4: If |t|>ta then reject H0

Problems with hypothesis tests

a) Type I Error Reject the null when the null is true. The p-value tells us the chance of making this type of error

b) Type II Error We fail to reject the null when the null is false. The chances of making this type of error decreases with larger sample sizes, with larger standard errors of the parameter estimate, and with the size of the true parameter.

Example 3: Assignment #1 Part III 2(a)

PBAF 528 Spring 2005 6 3. Confidence Interval for parameter estimate A (1-a)% confidence interval for the true coefficient (the slope or b on some ˆ predictor X) is given by: b1 ±t1-a 2,df SE ˆ , where we use t at n-k-1 degrees of b1 freedom.

We can be (1 – a)•100% confident that the true slope is between these values.

P-value

Probability that you would get an estimate so far (in SEs) from bH0 if H0 were true.

· p-values give the level of support for H0

· you can look up t-statistic in t or z table (or use Excel) to find the probability in the tail.

· If you select a = 0.05, then you would reject H0 if p < 0.05

Example 4: Assignment #1 Part III 2 (c)

The Research Process--Points to Address in a Research Proposal In the proposal, make sure you respond to each of these points.

1. Formulate a question/problem 2. Review the literature and develop a theoretical model Þ What else has been done on this? What theories address it? 3. Specify the model: independent and dependent variables

4. Hypothesize about the expected signs of the coefficients Þ Translate your question into specific hypotheses 5. Collect the data/operationalize Þ all variables have same number of observations Þ unit of analysis (person, month, year, household) Þ degrees of freedom (at least 1 more than the number of parameters—more is better) 6. Analysis Þ In the case of regression, estimate and evaluate the equation Þ In the proposal, discuss what analyses will you undertake.

PBAF 528 Spring 2005 7