Correlation and Regression s1

Total Page:16

File Type:pdf, Size:1020Kb

Correlation and Regression s1

Unit 8 Chapter 9 Correlation and Regression

Scatter Diagram and Linear Correlation

A scatter diagram is a graph in which data points (x, y) are plotted as individual points on a grid with horizontal axis x and vertical axis y. The x variable is called the explanatory variable. The y is the response variable.

By observing the scatter diagram it can be observed if there may be a linear relationship between the x and y values. Correlation will give us tools to determine if there exists a relationship and how strong the relationship is if it does exist. A linear relationship is what we are looking for.

A veterinary science study was conducted to study the weight of Shetland Ponies. The question poses was “How much should a healthy Shetland Pony weight?” The follow data was observed and expanded to develop a correlation for the situation. Then it was desired to construct a line of best fit for the data.

Weight of Shetland Ponies x = age of the pony (in months) y = average weight of the pony (in kilograms)

x y x^2 y^2 xy 3 60 9 3600 180 n = 5 6 95 36 9025 570 12 140 144 19600 1680 18 170 324 28900 3060 24 185 576 34225 4440

Totals 63 650 1089 95350 9930 A scatter diagram shows the point observed in the applications. The points show a close to linear pattern with the y increasing as the x increases.

Ages & Average Weights of Shetland Ponies

) 200 s 24, 185 m

a 18, 170 r

g 150 o l

i 12, 140 k (

t 6, 95 h 100 g i e W

e 3, 60

g 50 a r e v A 0 0 5 10 15 20 25 30 Age (months)

The Sample Correlation Coefficient r can be calculated to give a measure showing the strength on the linear association between the two variables. 1) The calculated r is between -1 and 1. 2) If r is = -1, there is a perfect negative correlation which means as the x variable increase, the y variable decrease. 3) If r is 1, there is a perfect positive correlation which means as the x variable increase, the y variable increase. 4) If r = 0, there is no linear correlation. 5) The closer r is to -1 and 1, the better/stronger the relationship.

Correlation Coefficient n xy   x y r     2 2 n x 2   x n y 2   y Use Excel to construct a table to calculate these totals. x = age of the pony (in months) y = average weight of the pony (in kilograms)

x y x^2 y^2 xy 3 60 9 3600 180 n = 5 6 95 36 9025 570 12 140 144 19600 1680 18 170 324 28900 3060 24 185 576 34225 4440

Totals 63 650 1089 95350 9930 n = 5  xy = 9930  x = 63  y = 650  x2 = 1089  y2 = 95350 59930 63650 8700 r    0.972 51089 632 595350 6502 8948.34

Since r = 0.972 is close to 1, there is a very high positive linear correlation.

Strength of Correlation Size of r Interpretation

Note: These values could be positive and negative. Only positive numbers are shown.

0.90 to 1.00 - very high 0.70 to 0.89 - high 0.50 to 0.69 - moderate 0.30 to 0.49 - low 0.00 to 0.29 - little, if any

Linear Regression and the Coefficient of Determination The scatter diagram below has a least-squares line overlaid in the grid. Excel uses the Trendline option to produce the line. But you should use the formula given to calculate the equation of the line.

Ages & Average Weights of Shetland Ponies y = 5.89x + 55.73

) 250 s m a r 200 g

o 24, 185 l

i 12, 140 k (

150 t 18, 170 h

g 6, 95 i

e 100 W

e

g 3, 60

a 50 r e v A 0 0 5 10 15 20 25 30 Age (months)

 Least-squares line y  a  bx where a is the intercept and b is the slope.  y is pronounced y -hat

Using the Excel sheet for the values-- _ x 63 First find sample mean for x: x     12.6 and n 5 _ y sample mean for y: y    130 n n xy   x y 59930 63650 8700 Slope b  2  2   5.89 n x 2   x 51089 63 1476

_ _ Intercept a  y b x  130  5.8912.6  55.79  Therefore the regression line is y  55.79  5.89x . (Note that the value in the excel line may vary slightly due to rounding.)

Using the least-squares line for prediction: Making predictions is the main application of linear regression. The least-squares ^ line can be used to predict y values for corresponding x values. There are two types of predictions. ^ 1) Interpolation: Predicting y values that are between observed x values in the data set. ^ For example, find y for a 10 year old pony. ^ y = 55.79 + 5.89 (10) = 114.69 lb ^ 2) Extrapolation: Predicting y values that are beyond observed x values in the data set. Extrapolation to far beyond observed x values may be unreasonable at some point. ^ For example, find y for a 30 year old pony. ^ y = 55.79 + 5.89 (30) = 203.04 lb

Coefficient of Determination r2 is formed by squaring the correlation coefficient r. r  0.792, r2  0.945

The coefficient of determination is a measurement of proportion of the variation in y explained by the regression line, using x as the explanatory variable.

For r2  0.945, then 94.5% of variation of y can be explained by x if we use the regression line. In addition, 5.5% of the variation is due to random chance or possibly a lurking variable.

Recommended publications