Scatter Correlation Classifications

• Scatter diagrams are used to demonstrate correlation between two • Correlation can be 19 quantitative variables. classified into three basic Regression 18 categories 19 • Often, this correlation is 17 18 • Linear 17 linear . 16 16

15 15 Weight • Nonlinear • This that a straight Weight line model can be 14 • No correlation 14 13 13 developed. 12 12 20 21 22 23 24 Length 20 21 22 23 24 Length

Chapter 5 # 1 Chapter 5 # 2

Correlation Classifications Correlation Classifications

• Two variables may be

correlated but not through Curvilinear Correlation

a . 500 50 40 400 • Two quantitative • This type of model is 30 300 variables may not be 20

result 10 called non-linear 200

correlated at all C4 0 • The model might be one 100 -10 -20 0 -30 of a curve. 0 5 10 15 -40 sample 0 5 10 15 20 25 Animalno

Chapter 5 # 3 Chapter 5 # 4 Linear Correlation Linear Correlation

• Variables that are correlated through a Regression Plot 19 linear relationship can • Negatively correlated Regression Plot 18 variables vary as display either positive 17 opposites 4 or negative correlation 16

15 • Positively correlated Weight • As the value of one 14 increases the variables vary directly. 3 13 other decreases 12 Student GPA 20 21 22 23 24 Length

2

0 10 20 30 40 Hours Worked

Chapter 5 # 5 Chapter 5 # 6

Strength of Correlation Strength of Correlation

• Correlation may be strong, • When the is moderate, or weak. Regression Plot distributed quite close Regression Plot to the line the 95 • You can estimate the 90 4 strength be observing the correlation is said to 85 be strong 80 variation of the points 75

70

around the line 3 • The correlation type is 65 independent of the 60 Final Exam Score • Large variation is weak Student GPA strength. 55 correlation 50

2 55 65 75 85 95 0 10 20 30 40 Midterm Stats Grade Hours Worked

Chapter 5 # 7 Chapter 5 # 8 The Interpreting r

• The sign of the correlation coefficient tells us the • The strength of a linear relationship is measured direction of the linear relationship by the correlation coefficient  If r is negative (<0) the correlation is negative. The • The sample correlation coefficient is given the line slopes down symbol “r”  If r is positive (> 0) the correlation is positive. The • The population correlation coefficient has the line slopes up symbol “ρρρ”.

Chapter 5 # 9 Chapter 5 # 10

Interpreting r Cautions

• The size (magnitude) of the correlation • The correlation coefficient only gives us an coefficient tells us the strength of a linear indication about the strength of a linear relationship relationship .  If | r | > 0.90 implies a strong linear association • Two variables may have a strong curvilinear  For 0.65 < | r | < 0.90 implies a moderate linear relationship, but they could have a “weak” value association for r  For | r | < 0.65 this is a weak linear association

Chapter 5 # 11 Chapter 5 # 12 Fundamental Rule of Correlation Setting

• Correlation DOES NOT imply causation • A chemical engineer would like to determine if a – Just because two variables are highly correlated does relationship exists between the extrusion not that the explanatory variable “causes ” the temperature and the strength of a certain response formulation of plastic. She oversees the production of 15 batches of plastic at various • Recall the discussion about the correlation temperatures and records the strength results. between sexual assaults and ice cream cone sales

Chapter 5 # 13 Chapter 5 # 14

The Study Variables The Experimental Data

• The two variables of interest in this study are the strength of the plastic and the extrusion temperature. • The independent variable is extrusion temp. This is the Temp 120 125 130 135 140 variable over which the experimenter has control. She Str 18 22 28 31 36 can set this at whatever level she sees as appropriate. • The response variable is strength. The value of “strength” Temp 145 150 155 160 165 is thought to be “dependent on” temperature. Str 40 47 50 52 58

Chapter 5 # 15 Chapter 5 # 16 The Scatter Plot Conclusions by Inspection

• The scatter for • Does there appear to be a relationship between the the temperature versus Scatter diagram of Strength vs Temperature 60 study variables? strength data allows us to deduce the nature of the 50 • Classify the relationship as: Linear, curvilinear, no relationship between these 40 relationship Strength (psi) two variables 30 • Classify the correlation as positive, negative, or no

20 correlation

120 130 140 150 160 170 Temperature (F) • Classify the strength of the correlation as strong, moderate, weak, or none

What can we conclude simply from the scatter diagram?

Chapter 5 # 17 Chapter 5 # 18

Computing r Computing r

1  x − x  y − y  1 r = Σ   r = Σ[]()Z ()Z −    − x y n 1  sx  sy  n 1

z-scores df z-scores for y data for x data

Chapter 5 # 19 Chapter 5 # 20 Computing r - Example Classifying the strength of linear correlation

•The strength of a linear correlation between the response See example handout for the plastic strength versus and the explanatory variable can be assigned based on r extrusion temperature setting

These classifications are discipline dependent

Chapter 5 # 21 Chapter 5 # 22

Scatter Diagrams and Statistical Classifying the strength of linear Modeling and Regression correlation

For this class the following criteria are adopted: • We’ve already seen that the best graphic for illustrating the relation between two quantitative If |r| > 0.90 then the correlation is strong variables is a scatter diagram. We’d like to take If |r| < 0.65 then the correlation is weak this concept a step farther and, actually develop a mathematical model for the relationship between two quantitative variables If 0.65 < |r| < 0.90 then the correlation is moderate

Chapter 5 # 23 Chapter 5 # 24 Using the Line of Best Fit to Make The Line of Best Fit Plot Predictions

• Since the data appears to be linearly related we can find a straight line model 60 • Based on this , what is the 60 that fits the data better 50 predicted strength for 50 than all other possible 40 plastic that has been 40 straight line models. Strength 30 extruded at 142 degrees? Strength • This is the Line of Best 30 20 Fit (LOBF) 20 120 130 140 150 160 170 Temp 120 130 140 150 160 170 Temp

Chapter 5 # 25 Chapter 5 # 26

Using the Line of Best Fit to Make Using the Line of Best Fit to Make Predictions Predictions

• Given a value for the predictor variable, determine • Based on this graphical the corresponding value of model, at what 60 the dependent variable temperature would I need 50 graphically. to extrude the plastic in • Based on this model we 40 order to achieve a strength Strength would predict a strength of 30 appx. 39 psi for plastic of 45 psi? extruded at 142 F 20

120 130 140 150 160 170 Temp

Chapter 5 # 27 Chapter 5 # 28 Using the Line of Best Fit to Make Predictions Computing the LSR model

• Locate 45 on the response axis (y-axis) • Given a LSR line for , we can use that • Draw a horizontal line to line to make predictions. the LOBF • Drop a vertical line down • How do we come up with the best linear model to the independent axis from all possible models? • The intercepted value is the temp. required to achieve a strength of 45 psi

Chapter 5 # 29 Chapter 5 # 30

Bivariate data and the sample linear The straight line model regression model

• For example, look at • Any straight line is completely defined by two the fitted line plot of : powerboat registrations and the  The slope – steepness either positive or negative number of manatees  The y-intercept – this is where the graph crosses the killed. vertical axis • It appears that a linear model would be a good one. = + yˆ b o b1 x Chapter 5 # 31 Chapter 5 # 32 The Estimators Calculating the Parameter Estimators

• The equation for the LOBF is: • In our model “b0” is the estimator for the intercept. The true value for this parameter is β 0 = + yˆ bo b1x • “b1” estimates the slope. The true value for this β parameter is 1

Chapter 5 # 33 Chapter 5 # 34

Calculating the Parameter Estimators Computing the Intercept Estimator

• To get the slope estimator we use: • The intercept estimator is computed from the variable means and the slope: n Σ (x y ) − Σ x ⋅ Σ y b = 1 2 2 = − n Σ ()x − ()Σ x b0 y b1x or • Realize that both the slope and intercept  s  =  y  estimated in these last two slides are really point b1 r   estimates for the true slope and y-intercept  s x  Chapter 5 # 35 Chapter 5 # 36 Revisit the manatee example Computing the estimators

Look at the summary and correlation coefficient data from the manatee example So the slope is:

Variable N Mean SEMean StDev  s  =  y  Boats 10 74.10 2.06 6.51 b1 r   s Deaths 10 55.80 5.08 16.05  x  = 16 05.  = Minitab correlation coefficient output b1 .0 921   27.2 Correlations: Boats, Deaths  51.6  Pearson correlation of Boats and Deaths = 0.921 P-Value = 0.000

Chapter 5 # 37 Chapter 5 # 38

Computing the estimators Put it together

• In general terms any old And the intercept is calculated using the slope equation is: along with the variable means: response = intercept + slope(predictor)

b = y − b x 0 1 • Specifically for the manatee example the sample = 55 8. − 27.2 ()74 1. regression equation is: = −112 4. Deaths = -112.7 + 2.27(boats)

Chapter 5 # 39 Chapter 5 # 40 The slope estimate The slope estimate

• In our example the estimated slope is 2.27 • b1 is the estimated slope of the line • The interpretation of the slope is, “The amount of change in the response for every one unit change in the • This is interpreted as, “For each additional 10,000 boats independent variable.” registered, an additional 2.27 more manatees are killed

Chapter 5 # 41 Chapter 5 # 42

The intercept estimate The intercept Estimate

• Recall the sample regression model: • Sometimes this value is meaningful. For example resting metabolic rate versus ambient temperature in “b ” is the estimated y- intercept 0 Centigrade ( oC) yˆ = b + b x • Sometimes it’s not meaningful at all. 0 1 • This is an example where the y-intercept just serves to The interpretation of the y-intercept is, “The make the model fit better. There can be no such thing as a –112.7 manatees killed value of the response when the control (or independent) variable has a value of 0.”

Chapter 5 # 43 Chapter 5 # 44 Regression Output Regression Output

• The sample regression equation is: Use the minitab regression output for the ManateesKilled = -112.7 + 2.27(boats) manatee example to predict the expected number of manatees killed when the number of • So: power boat registrations is 750,000 (x = 75) ManateesKilled = -112.7 + 2.27(75) = 57.6

• This means that we expect between 57 and 58 manatees killed in a year where 750,000 power boats are registered.

Chapter 5 # 45 Chapter 5 # 46

Regression Output Regression Output

• The sample regression equation is: Use the minitab regression output for the ManateesKilled = -112.7 + 2.27(boats) manatee example to predict the expected number of manatees killed when the number of • So: power boat registrations is 850,000 (x = 85) ManateesKilled = -112.7 + 2.27(85) = 80.25

• This means that we expect between 80 and 81 manatees killed in a year where 750,000 power boats are registered.

Chapter 5 # 47 Chapter 5 # 48 Regression Output Cardinal Rule of Regression

• NEVER NEVER NEVER NEVER NEVER NEVER predict a response value from a predictor value that is STOP!! YOU HAVE VIOLATED outside of the experimental . THE CARDINAL RULE OF REGRESSION • The only predictions we can make (statistically) are predictions for responses where powerboat registrations are between 670,000 and 840,000. • This means that our prediction for the year when 850,000 powerboats were registered is garbage

Chapter 5 # 49 Chapter 5 # 50

Regression Estimates The coefficient of determination The r 2 Value • r2 is called the coefficient of determination . • r2 is a proportion, so it is a number between 0 and 1 • If r 2 is, say, 0.857 we can conclude that 85.7% of the inclusive. variability in the response is explained by the • r2 quantifies the amount of variation in the response that is variability in the independent variable. due to the variability in the predictor variable. • This leaves 100 - 85.7 = 14.3% left unexplained. It’s • r2 values close to 0 mean that our estimated model is a only the unexplained variation that is incorporated into poor one while values close to 1 imply that our model the “uncertainty” does a great job explaining the variation

Chapter 5 # 51 Chapter 5 # 52 Scatter of Points and r 2 r2 and the correlation coefficient

• r2 is related to the correlation coefficient • It’s just the square of r • The interpretation as the proportion of variation in the response that is explained by the variation in the predictor variable makes it an important

r2 = 0.848 r2 = 0.992

Chapter 5 # 53 Chapter 5 # 54