Chapters 2 and 10: Least Squares Regression
Total Page:16
File Type:pdf, Size:1020Kb
Chapters 2 and 10: Least Squares Regression Learning goals for this chapter: Describe the form, direction, and strength of a scatterplot. Use SPSS output to find the following: least-squares regression line, correlation, r2, and estimate for σ. Interpret a scatterplot, residual plot, and Normal probability plot. Calculate the predicted response and residual for a particular x-value. Understand that least-squares regression is only appropriate if there is a linear relationship between x and y. Determine explanatory and response variables from a story. Use SPSS to calculate a prediction interval for a future observation. Perform a hypothesis test for the regression slope and for zero population correlation/independence, including: stating the null and alternative hypotheses, obtaining the test statistic and P-value from SPSS, and stating the conclusions in terms of the story. Understand that correlation and causation are not the same thing. Estimate correlation for a scatterplot display of data. Distinguish between prediction and extrapolation. Check for differences between outliers and influential outliers by rerunning the regression. Know that scatterplots and regression lines are based on sample data, but hypothesis tests and confidence intervals give you information about the population parameter. When you have 2 quantitative variables and you want to look at the relationship between them, use a scatterplot. If the scatter plot looks linear, then you can do least squares regression to get an equation of a line that uses x to explain what happens with y. The general procedure: 1. Make a scatter plot of the data from the x and y variables. Describe the form, direction, and strength. Look for outliers. 2. Look at the correlation to get a numerical value for the direction and strength. 3. If the data is reasonably linear, get an equation of the line using least squares regression. 4. Look at the residual plot to see if there are any outliers or the possibility of lurking variables. (Patterns bad, randomness good.) 1 5. Look at the normal probability plot to determine whether the residuals are normally distributed. (The dots sticking close to the 45-degree line is good.) 6. Look at hypothesis tests for the correlation, slope, and intercept. Look at confidence intervals for the slope, intercept, and mean response, and at the prediction intervals. 7. If you had an outlier, you should re-work the data without the outlier and comment on the differences in your results. Association Positive, negative, or no association Remember: ASSOCIATON or CORRELATION is NOT the same thing as CAUSATION. (See chapter 3/2.5 notes.) Response variable: Y Dependent variable measures an outcome of a study Explanatory variable: X Independent variable explains or is related to changes in the response variables (p. 105) Scatterplots: Show the relationship between 2 quantitative variables measured on the same individuals Dots only—don’t connect them with a line or a curve Form: Linear? Non-linear? No obvious pattern? Direction: Positive or negative association? No association? Strength: how closely do the points follow a clear form? Strong or weak or moderate? Look for OUTLIERS! Correlation: measures the direction and strength of the linear relationship between 2 quantitative variables, r. It is the standardized value for each observation with respect to the mean and standard deviation. 2 1 x x y y r ii where we have data on variables x and y for n individuals. n1 sxy s You won’t need to use this formula, but SPSS will. Using SPSS to get correlation: Use the Pearson Correlation output. Analyze --> Correlate --> Bivariate (see page 55 in the SPSS manual). The SPSS manual tells you where to find r using the least squares regression output, but this r is actually the ABSOLUTE VALUE OF r, so you need to pay attention to the direction yourself. The Pearson Correlation gives you the actual r with the correct sign. Properties of correlation: X and Y both have to be quantitative. It makes no difference which you call X and which you call Y. Does not change when you change the units of measurement. If r is positive, there is a positive association between X and Y As X increases, Y increases If r is negative, there is a negative association between X and Y As X increases, Y decreases 11r The closer r is to –1 or to 1, the stronger the linear relationship The closer r is to 0, the weaker the linear relationship Outliers strongly affect r. Use r with caution if outliers are present. 3 Example: We want to examine whether the amount of rainfall per year increases or decreases corn bushel output. A sample of 10 observations was taken, and the amount of rainfall (in inches) was measured, as was the subsequent growth of corn. Amount of Rain Bushels of Corn 3.03 80 3.47 84 4.21 90 4.44 95 4.95 97 5.11 102 5.63 105 6.34 112 6.56 115 6.82 115 The scatterplot: 120 110 100 90 80 corn yield (bushels) 70 2 3 4 5 6 7 amount of rain (in) a) What does the scatterplot tell us? What is the form? Direction? Strength? What do we expect the correlation to be? 4 Correlations amount of corn yield rain (in) (bushels) amount of rain (in) Pearson Correlation 1 .995(**) Sig. (2-tailed) . .000 N 10 10 corn yield (bushels) Pearson Correlation .995(**) 1 Sig. (2-tailed) .000 . N 10 10 ** Correlation is significant at the 0.01 level (2-tailed). Inference for Correlation: R = correlation R2 = % of variation in Y explained by the regression line (the closer to 100%, the better) ρ (Greek letter rho) = correlation for the population When ρ = 0, there is no linear association in the population, so X and Y are independent (if X and Y are both normally distributed). Hypothesis test for correlation: rn 2 To test the null hypothesis H0: ρ = 0, SPSS will compute the t statistic: t , 1 r 2 degrees of freedom = n – 2 for simple linear regression. b) Are corn yield and rain independent in the population? Perform a test of significance to determine this. c) Do corn yield and rain have a positive correlation in the population? Perform a test of significance to determine this. This test statistic for the correlation is numerically identical to the t statistic used to test H0: 1 = 0. Can we do better than just a scatter plot and the correlation in describing how x and y are related? What if we want to predict y for other values of x? 5 Least-Squares Regression fits a straight line through the data points that will minimize the sum of the vertical distances of the data points from the line. n 2 Minimizes ()ei i 1 Equation of the line is: yˆˆ b01 b x, with y= the predicted yline sy Slope of the line is: br1 , where the slope measures the amount of change sx caused in the predicted response variable when the explanatory variable is increased by one unit. Intercept of the line is: b01 y b x , where the intercept is the value of the predicted response variable when the explanatory variable = 0. Type of line Least Squares Regression slope y-intercept equation of line Ch. 10 Sample yˆ b01 b x b1 b0 Ch. 10 Population yxi01 i i 1 0 (model) Using the corn example, find the least squares regression line. Tell SPSS to do AnalyzeRegression Linear. Put “rain” into the independent box and “corn” into the dependent box. Click OK. Model Summaryb Adjusted Std. Error of Model R R Square R Square the Estimate 1 .995a .991 .989 1.290 a. Predictors: (Constant), amount of rain (in) b. Dependent Variable: corn yield (bushels) ANOVAb Sum of Model Squares df Mean Square F Sig. 1 Regression 1397.195 1 1397.195 840.070 .000a Residual 13.305 8 1.663 Total 1410.500 9 a. Predictors: (Constant), amount of rain (in) b. Dependent Variable: corn yield (bushels) Coe fficientsa Unstandardized Standardized Coefficients Coefficients 95% Confidence Interval for B Model B Std. Error Beta t Sig. Lower Bound Upper Bound 1 (Constant) 50.835 1.728 29.421 .000 46.851 54.819 amount of rain (in) 9.625 .332 .995 28.984 .000 8.859 10.391 a. Dependent Variable: corn yield (bushels) 6 d) What is the least-squares regression line equation? The scatterplot with the least squares regression line looks like: 120 R2 is the percent of 110 variation in corn yield explained by the regression line with rain= 99.06% 100 90 80 corn yield (bushels) 70 Rsq = 0.9906 2 3 4 5 6 7 amount of rain (in) Hypothesis testing for H0: 1 = 0 b Test statistic: t 1 with df = n - 2 SEb1 SPSS will give you the test statistic (under t), and the 2-sided P-value (under Sig.). e) Is the slope positive in the population? Perform a test of significance. f) What % of the variability in corn yield is explained by the least squares regression line? g) What is the estimate of the standard error of the model? 7 What do we mean by prediction or extrapolation? Use your least-squares regression line to find y for other x-values. Prediction: using the line to find y-values corresponding to x-values that are within the range of your data x-values. Extrapolation: using the line to find y-values corresponding to x-values that are outside the range of your data x-values. Be careful about extrapolating y-values for x-values that are far away from the x data you currently have.