1 Correlation, Scatterplots, and Regression Lines

1 Correlation, Scatterplots, and Regression Lines The sample correlation coefficient r is a measure of the linear relationship between two variables X and Y . 1 n x x y y r = i − i − n 1 s s i=1 x y − X µ ¶µ ¶ where 1 n x = x , n i i=1 X 1 n y = y n i i=1 X are the means of the x’s and y’s, and 1 n s = (x x)2, x v n 1 i − uµ ¶ i=1 u − X t 1 n s = (y y)2 y v n 1 i − uµ ¶ i=1 u − X t are the sample standard deviations of the x’s and y’s. Quick facts about correlation: Correlation is a measure of the linear (straight-line) relationship between two variables. There can be a strong • (or even perfect) nonlinear (parabolic, circular,...) relationship between two variables while the correlation is zero indicating no linear relationship. 1 r 1 • − ≤ ≤ If r =+1,wesayX and Y are perfectly positively correlated. The linear relationship is perfect. In this case, • y = ax + b with a>0. That is, if all the ordered pairs (x, y) were plotted they would fall exactly on a line with positive slope. If r = 1,wesayX and Y are perfectly negatively correlated. The linear relationship is perfect. In this • case, y −= ax + b with a<0. That is, if all the ordered pairs (x, y) were plotted they would fall exactly on a line with negative slope. If r =0, we say X and Y are uncorrelated. There is no linear relationship between the variables. • The magnitude (absolute value, distance from zero) of the correlation coefficient indicates the strength of the • linear relationship between X and Y. The sign (positive or negative) of the correlation coefficient indicates the direction (positive or negative) of • the linear relationship between X and Y. CORRELATION IS NOT CAUSATION!!! Just because X and Y are correlated, it does not mean • that one causes the other. If r>0, when x is above its mean, y tends to be above its mean (when x is below its mean, y tends to be • below its mean). That is, they tend to move in the same direction relative to their means. If r<0, when x is above its mean, y tends to be below its mean (when x is below its mean, y tends to be • above its mean). That is, they tend to move in the opposite direction relative to their means. The correlation coefficient (like the mean and standard deviation) can be greatly affected by outliers. • 1 r =-1 - 0.9 = r - 0.8 = r 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 2 4 6 8 2 4 6 8 2 4 6 8 - 0.7 = r - 0.6 = r - 0.5 = r 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 - 0.4 = r - 0.3 = r - 0.2 = r 5 6 5 4 5 4 4 3 3 3 2 2 2 1 1 1 1 2 3 4 5 6 7 1 2 3 4 5 6 7 2 4 6 8 - 0.1 = r 0. = r 0.1 = r 5 6 6 5 5 4 4 4 3 3 3 2 2 2 1 1 1 2 4 6 8 2 4 6 8 2 4 6 0.2 = r 0.3 = r 0.4 = r 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 1 2 3 4 5 6 7 1 2 3 4 5 6 7 2 4 6 0.5 = r 0.6 = r 0.7 = r 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 2 4 6 8 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0.8 = r 0.9 = r r = 1 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 2 4 6 8 1 2 3 4 5 6 7 2 4 6 8 Scatterplots of X and Y from r = 1 to r =1in steps of 0.1. − 2 A regression line describes how a response variable y changes as an explanatory variable x changes. If the regression line is y = a + bx, the intercept a is the value of y when x =0. The slope b measures the change in y for a one unit increase in x. Once a and b are known, to predict a value of y given an x, simply plug the x-value into the equation and solve for y. However, prediction outside the range of data used to estimate the line should be avoided. If the line was generated for x in the interval [10, 20] for example, predictions for x<10 or x>20 may not be valid. Example: Suppose the annual sales (y) of a company are related to annual advertising expenditures (x)and the linear regression line is y = 10000 + 5x where x is measured in dollars. What is the predicted annual sales if x = 10000? y = 10000 + 5x = 10000 + 5 (10000) = 60, 000. If the range of annual advertising used to generate this line was (5000, 20000) , the predicted value of sales without advertising y = 10000 + 5 (0) = 10000 may not be valid. The slope coefficient b of a regression of y on x is related to the correlation coefficient r via the formula s b = r y sx where sy is the standard deviation of y and sx is the standard deviation of x. The signs of r and b will be the same (either both positive or both negative). The coefficient of determination, r2, indicates the proportion of the variability of Y that can be ‘explained’ by the variability of X. Forexample,ifweperformalinearregressionofY on the explanatory variable X and find that r2 =0.60 then 60% of the variation in Y (about its mean) can be ‘explained’ by variation in X (about its mean). Quick facts about r2 : r2 isthesquareofthecorrelationcoefficient of X and Y. • 0 r2 1. • ≤ ≤ If r2 =0, there is no linear relationship between X and Y. • If r2 =1, there is a perfect linear relationship between X and Y. Without additional information, we cannot • 2 2 tell if this relationship is positive or negative since ( 1) =1=(+1) . − 3 2 Price Indices The consumer price index (CPI) is a measure of inflation. In general, the value of a dollar declines over time. In other terms, to buy a fixed basket of goods (say the typical goods and services an urban household purchases) it takes more dollars today than it did in the past. The current (nominal) price of a good is the cost in today’s dollars. The real (inflation-adjusted) price of a good takes into account that the general level of prices has changed. To convert between the real value of dollars at time B and the equivalent real value of dollars at time A using CPI as our indicator of the general level of prices use the formula dollars at time A dollars at time B = CPI at time B. CPI at time A × Example: Suppose the CPI for 2000 to 2006 and the prices of a given statistics text are as follows: Year 2000 2001 2002 2003 2004 2005 2006 CPI 100 103 107 112 115 119 124 Text 74 82 88 96 105 118 134 1. In terms of 2002 dollars, what is the textbook price in 2006? Let B be year 2002 and A be 2006 then the 2006 price in 2002 dollars is dollars at time A dollars at time B = CPI at time B CPI at time A × 134 = 107 = 115.63. 124 × 2. In terms of 2006 dollars, what is the textbook price in 2002? Let A be year 2002 and B be 2006 then the 2006 price in 2002 dollars is dollars at time A dollars at time B = CPI at time B CPI at time A × 88 = 124 = 101.98. 107 × 3. How much did the price of the textbook increase from 2000 to 2006: (a) In nominal terms (i.e., stated price) 134 74 − =0.810811 81%. 74 ' (b) In real terms (i.e., adjusted for the general level of inflation). Let A be year 2000 and B be 2006 then the real increase in the textbook price was dollars at time B dollars at time A 134 74 CPI at time B CPI at time A 124 100 dollars at− time A = 74− =0.460331 46% CPI at time A 100 ' dollars at time B 134 dollars at time A 74 = CPI at time B 1= 124 1=0.460331 46% CPI at time A − 100 − ' 4 3 Probability We denote by S the collection of all possible outcomes. • An event is a collection of outcomes. • If A is an event, the probability that A occurs is denoted by P (A) . • 0 P (A) 1, probabilities of events are between 0 and 1,inclusive. • ≤ ≤ P ( )=0, where is the impossible event (an event that cannot happen). • ∅ ∅ P (S)=1, where S is the sure event (the event that one of the possible outcomes occurs — toss a coin and • the probability that it comes up either heads or tails is one (one of the possible outcomes must occur)).

1 Correlation, Scatterplots, and Regression Lines

04 – Everything You Want to Know About Correlation but Were

11. Correlation and Linear Regression

Construct Validity and Reliability of the Work Environment Assessment Instrument WE-10

14: Correlation

CORRELATION COEFFICIENTS Ice Cream and Crimedistribute Difficulty Scale ☺ ☺ (Moderately Hard)Or

Eight Things You Need to Know About Interpreting Correlations

Chapter 11 -- Correlation

Alternatives to Pearson's and Spearman's Correlation Coefficients

Pearson's Correlation Was Run to Determine the Relationship Between 14 Females' Hb and PCV Values

Relationship Between Two Variables: Correlation, Covariance and R-Squared

Chapter 9: Serial Correlation

Sampling Variance in the Correlation Coefficient Under Indirect Range Restriction: Implications for Validity Generalization