1 Correlation, Scatterplots, and Regression Lines

The correlation coefficient r is a measure of the linear relationship between two variables X and Y .

1 n x x y y r = i − i − n 1 s s i=1 x y − X µ ¶µ ¶ where

1 n x = x , n i i=1 X 1 n y = y n i i=1 X are the of the x’s and y’s, and

1 n s = (x x)2, x v n 1 i − uµ ¶ i=1 u − X t 1 n s = (y y)2 y v n 1 i − uµ ¶ i=1 u − X t are the sample standard deviations of the x’s and y’s. Quick facts about correlation:

Correlation is a measure of the linear (straight-line) relationship between two variables. There can be a strong • (or even perfect) nonlinear (parabolic, circular,...) relationship between two variables while the correlation is zero indicating no linear relationship. 1 r 1 • − ≤ ≤ If r =+1,wesayX and Y are perfectly positively correlated. The linear relationship is perfect. In this case, • y = ax + b with a>0. That is, if all the ordered pairs (x, y) were plotted they would fall exactly on a line with positive slope. If r = 1,wesayX and Y are perfectly negatively correlated. The linear relationship is perfect. In this • case, y −= ax + b with a<0. That is, if all the ordered pairs (x, y) were plotted they would fall exactly on a line with negative slope. If r =0, we say X and Y are uncorrelated. There is no linear relationship between the variables. • The magnitude (absolute value, distance from zero) of the correlation coefficient indicates the strength of the • linear relationship between X and Y. The sign (positive or negative) of the correlation coefficient indicates the direction (positive or negative) of • the linear relationship between X and Y. CORRELATION IS NOT CAUSATION!!! Just because X and Y are correlated, it does not • that one causes the other. If r>0, when x is above its mean, y tends to be above its mean (when x is below its mean, y tends to be • below its mean). That is, they tend to move in the same direction relative to their means. If r<0, when x is above its mean, y tends to be below its mean (when x is below its mean, y tends to be • above its mean). That is, they tend to move in the opposite direction relative to their means. The correlation coefficient (like the mean and ) can be greatly affected by . •

1 r =-1 - 0.9 = r - 0.8 = r 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1

2 4 6 8 2 4 6 8 2 4 6 8

- 0.7 = r - 0.6 = r - 0.5 = r 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1

1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

- 0.4 = r - 0.3 = r - 0.2 = r 5 6 5 4 5 4 4 3 3 3 2 2 2 1 1 1

1 2 3 4 5 6 7 1 2 3 4 5 6 7 2 4 6 8

- 0.1 = r 0. = r 0.1 = r 5 6 6 5 5 4 4 4 3 3 3 2 2 2 1 1 1

2 4 6 8 2 4 6 8 2 4 6

0.2 = r 0.3 = r 0.4 = r 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1

1 2 3 4 5 6 7 1 2 3 4 5 6 7 2 4 6

0.5 = r 0.6 = r 0.7 = r 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1

2 4 6 8 1 2 3 4 5 6 7 1 2 3 4 5 6 7

0.8 = r 0.9 = r r = 1 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1

2 4 6 8 1 2 3 4 5 6 7 2 4 6 8 Scatterplots of X and Y from r = 1 to r =1in steps of 0.1. −

2 A regression line describes how a response variable y changes as an explanatory variable x changes. If the regression line is

y = a + bx, the intercept a is the value of y when x =0. The slope b measures the change in y for a one unit increase in x. Once a and b are known, to predict a value of y given an x, simply plug the x-value into the equation and solve for y. However, prediction outside the of data used to estimate the line should be avoided. If the line was generated for x in the interval [10, 20] for example, predictions for x<10 or x>20 may not be valid. Example: Suppose the annual sales (y) of a company are related to annual advertising expenditures (x)and the line is

y = 10000 + 5x where x is measured in dollars. What is the predicted annual sales if x = 10000?

y = 10000 + 5x = 10000 + 5 (10000) = 60, 000.

If the range of annual advertising used to generate this line was (5000, 20000) , the predicted value of sales without advertising

y = 10000 + 5 (0) = 10000 may not be valid. The slope coefficient b of a regression of y on x is related to the correlation coefficient r via the formula s b = r y sx where sy is the standard deviation of y and sx is the standard deviation of x. The signs of r and b will be the same (either both positive or both negative). The coefficient of determination, r2, indicates the proportion of the variability of Y that can be ‘explained’ by the variability of X. Forexample,ifweperformalinearregressionofY on the explanatory variable X and find that r2 =0.60 then 60% of the variation in Y (about its mean) can be ‘explained’ by variation in X (about its mean). Quick facts about r2 :

r2 isthesquareofthecorrelationcoefficient of X and Y. • 0 r2 1. • ≤ ≤ If r2 =0, there is no linear relationship between X and Y. • If r2 =1, there is a perfect linear relationship between X and Y. Without additional information, we cannot • 2 2 tell if this relationship is positive or negative since ( 1) =1=(+1) . −

3 2 Price Indices

The consumer price index (CPI) is a measure of inflation. In general, the value of a dollar declines over time. In other terms, to buy a fixed basket of goods (say the typical goods and services an urban household purchases) it takes more dollars today than it did in the past. The current (nominal) price of a good is the cost in today’s dollars. The real (inflation-adjusted) price of a good takes into account that the general level of prices has changed. To convert between the real value of dollars at time B and the equivalent real value of dollars at time A using CPI as our indicator of the general level of prices use the formula dollars at time A dollars at time B = CPI at time B. CPI at time A × Example: Suppose the CPI for 2000 to 2006 and the prices of a given statistics text are as follows:

Year 2000 2001 2002 2003 2004 2005 2006 CPI 100 103 107 112 115 119 124 Text 74 82 88 96 105 118 134

1. In terms of 2002 dollars, what is the textbook price in 2006? Let B be year 2002 and A be 2006 then the 2006 price in 2002 dollars is dollars at time A dollars at time B = CPI at time B CPI at time A × 134 = 107 = 115.63. 124 ×

2. In terms of 2006 dollars, what is the textbook price in 2002? Let A be year 2002 and B be 2006 then the 2006 price in 2002 dollars is dollars at time A dollars at time B = CPI at time B CPI at time A × 88 = 124 = 101.98. 107 ×

3. How much did the price of the textbook increase from 2000 to 2006:

(a) In nominal terms (i.e., stated price) 134 74 − =0.810811 81%. 74 ' (b) In real terms (i.e., adjusted for the general level of inflation). Let A be year 2000 and B be 2006 then the real increase in the textbook price was

dollars at time B dollars at time A 134 74 CPI at time B CPI at time A 124 100 dollars at− time A = 74− =0.460331 46% CPI at time A 100 ' dollars at time B 134 dollars at time A 74 = CPI at time B 1= 124 1=0.460331 46% CPI at time A − 100 − '

4 3 Probability

We denote by S the collection of all possible outcomes. • An event is a collection of outcomes. • If A is an event, the probability that A occurs is denoted by P (A) . • 0 P (A) 1, probabilities of events are between 0 and 1,inclusive. • ≤ ≤ P ( )=0, where is the impossible event (an event that cannot happen). • ∅ ∅ P (S)=1, where S is the sure event (the event that one of the possible outcomes occurs — toss a coin and • the probability that it comes up either heads or tails is one (one of the possible outcomes must occur)). P AC =1 P (A) , where AC is the event that A does not occur. • − P (¡A ¢B)=P (A)+P (B) P (A B) , where A B is the event that A occurs or B occurs or both occur • and A∪ B is the event that− both A∩and B occur. ∪ ∩ If A and B have no outcomes in common, P (A B)=P (A)+P (B) . • ∪ If there are a finite number of outcomes, to find the probability of an event we simply add the probabilities of • the individual outcomes that make up the event (If we roll a 6-sided die, the probability of the event A we roll an even number is the probability that we roll a 2 plus the probability that we roll a 4 plus the probability that we roll a 6) If knowing that event B occurs doesn’t change the probability that A occurs, we say A and B are independent • events (say we toss two coins — the event A ‘heads on the first coin’ and event B ‘heads on the second coin’ would be independent events). If events A and B are independent, P (A B)=P (A) P (B) (If we toss two fair coins, the probability that • 1 ∩ they are both heads is 4 — the probability that the first is a head (1/2) times the probability that the second is a head (1/2)). The conditional probability that event A occurs given that event B occurs is • P (A B) P (A B)= ∩ . | P (B)

The expected value of a random variable X is given by • E (X)= xP (X = x) . x X That is, it is a weighted average of the possible outcomes where the weights are the associated probabilities (If X is the number of heads on one toss of a fair coin, E (X)=0(1/2) + 1 (1/2) = 1/2).

Example: Suppose we have a loaded (not balanced) 6-sided die that has the following :

X 123456 P (X = x) 0.10.20.30.20.10.1

1. What is the set of all possible outcomes?

S = 1, 2, 3, 4, 5, 6 { } 2. Is this probability distribution valid? (there are two parts to this question)

(a) Are the probability assignments to each of the possible values nonnegative ( 0)? Yes. ≥

5 (b) Do the probability assignments add up to one? Yes,

0.1+0.2+0.3+0.2+0.1+0.1=1.0.

So this is a valid probability distribution.

3. What is P (X =4)?From the table of probabilities, P (X =4)=0.2. 4. What is the probability that X =8?That is, what is P (X =8)? Zero. You cannot roll an 8 —itisan impossible event. All impossible events have probability zero. 5. If A is the event an odd number is rolled, what is P (A)?

A = 1, 3, 5 { } P (A)=P ( 1, 3, 5 )=P (X =1)+P (X =3)+P (X =5) { } =0.1+0.3+0.1=0.5.

6. If B is the event that a 5 or a 6 is rolled, what is P (B)?

B = 5, 6 { } P (B)=P (X =5)+P (X =6)=0.1+0.1=0.2.

7. What is the probability that both A and B occur? That is, what is P (A B)? ∩ A B = 5 ∩ { } P (A B)=P (X =5)=0.1. ∩ 8. What is the probability that either A or B occurs? That is, what is P (A B)? Since the events have a common outcome, namely X =5, ∪

P (A B)=P (A)+P (B) P (A B)=0.5+0.2 0.1=0.6. ∪ − ∩ − 9. What is the conditional probability that B occurs given that A occurred (If you roll a 1,3,or5,whatisthe probability that the number rolled is a 5 or a 6)?

P (B A) P (A B) 0.1 P (B A)= ∩ = ∩ = =0.2. | P (A) P (A) 0.5

10. Are A and B independent events (does knowing that A occurred alter the probability that B occurred)?

P (A B)=0.1=(0.5) (0.2) = P (A) P (B) . ∩ Thus the events are independent. 11. What is the expected value of X?

E (X)= xP (X = x) x X =1P (X =1)+2P (X =2)+3P (X =3)+4P (X =4)+5P (X =5)+6P (X =6) =1(0.1) + 2 (0.2) + 3 (0.3) + 4 (0.2) + 5 (0.1) + 6 (0.1) =0.1+0.4+0.9+0.8+0.5+0.6 =3.3.

The expected value need not be a possible value but it must be a number between the smallest possible outcome and the largest possible outcome.

6 4 Normal Distribution

The normal curve is symmetric about the mean. Hence the mean and are the same. Two parameters define any normal variable, the mean µ and σ2 (or the standard deviation σ). The proportion of values to the right of the mean (median) is one-half as is the proportion to the left of the mean. Typically we are interested in finding out proportions from one of three types of regions. If X is normal with mean µ and standard deviation σ, and a and b are any real numbers, we can ask

P (Xb) P (a

We only have (only need) tables for the standard normal random variable. A standard normal variable has mean zero and variance one. For X normal with mean µ and standard deviation σ, we have X µ Z = − σ is a standard normal variable. Then,

X µ a µ a µ P (Xb)=P − > − = P Z> − =1 P Z< − σ σ σ − σ µ ¶ µ ¶ µ ¶ b µ a µ P (a

As an example, suppose heights are normal with mean 65 inches and standard deviation 3 inches. (i)Whatproportionarelessthan62 inches?

X µ 62 µ 62 65 P (X<62) = P − < − = P Z< − = P (Z< 1) = 0.1587 σ σ 3 − µ ¶ µ ¶ where the proportion to the left of Z = 1 can be found in a table or using the graph below. (ii) What proportion are taller than− 59 inches?

X µ 59 µ 59 65 P (X>59) = P − > − = P Z> − = P (Z> 2) = 1 0.0228 = 0.9772. σ σ 3 − − µ ¶ µ ¶

7 If X is normal with mean µ and standard deviation σ: (i) about 68.26% of observations are within one standard deviation of the mean, (ii) about 95.44% of observations are within two standard deviations of the mean, (iii) about 99.74% of observations are within three standard deviations of the mean.

The tables can also be used in reverse — given a proportion less than (greater than, between) an unknown value x, find if X is normal with mean µ and variance σ2.

X µ Z = − X = µ + Zσ. σ ⇒ given a proportion, find the corresponding z andplugintotheformula

x = µ + zσ.

Example: If heights are normal with mean 65 inches and standard deviation 3 inches, 95% are less than what height? From the normal table, 95% are smaller than z when z =1.64. Hence

x =65+(1.64) (3) = 69.92.

8 5Confidence Intervals, Distributions, and Margin of Error

The margin of error for an approximate 95% confidence interval (CI) for a population proportion p is given by

p (1 p) MoE =2 − , r n where n is the sample size and p is the sample proportionb thatb estimates the population parameter p.A95% confidence interval for p is then b p (1 p) p (1 p) p 2 − , p +2 − . Ã − r n r n ! b b b b b b p(1 p) If n is reasonably large (say n>30), then p is approximately Normal with mean p andstandarddeviation n− . Recall that if Z is standard normal, then 95.44% of the values fall within 2 standard deviations of the meanq (henceb b the multiplicative factor of 2 — sometimesb you’ll see 1.96 instead as 95% of the values fall within 1.96 standard deviations of the mean for a standard normal variable). Example: In a survey of 200 people, 40% were for Proposition Q. What is an approximate 95% CI for the true proportion supporting Proposition Q?

p =0.40 n =200 b 0.4(0.6) σ = =0.034641 r 200 p (1 p) 0.4(0.6) MoEb =2 − =2 =0.0693. r n r 200 Hence, an approximate 95% CI for the true proportionb b supporting Proposition Q is given by

p (1 p) p (1 p) p 2 − , p +2 − =(0.40 0.0693, 0.40 + 0.0693) Ã − r n r n ! − b b b b b b =(0.3307, 0.4693) .

9