<<

Bivariate Analysis

IB Content 5.4:  Linear correlation of bivariate data  Pearson’s product– correlation coefficient r  Scatter diagrams; lines of best fit. Equation of the regression line of y on x. Use of the equation for prediction purposes. Mathematical and contextual interpretation.

Bivariate data is data based on two variables. Typically the independent variable is x and the dependent variable is y. Time is a common independent variable.

Bivariate data can be displayed as a Scatter Diagram () and from the diagram we can see if there is a relationship or Correlation between the two variables.

The correlation between two variables can be positive, negative or no correlation (zero) and can be rated as strong, moderate or weak. The relationship can also be linear or non-linear.

Below are examples of various correlations from Mathematics Standard Level, Oxford, page 335.

No correlation Strong negative correlation Strong pos. non-linear corr. Example: For each data set, determine the correlation and if the relationship is linear or non-linear. (taken from http://www.regentsprep.org/regents/math/algebra/ad4/scatter.htm) a) b) c)

Solutions: a) weak positive linear, b) zero, c) strong negative correlation

It is important to note that just because two variables have a correlation it may not that one variable causes the other! This is called Causation.

Line of best fit (Regression Line): A line that is drawn on a scatter diagram to find the direction and show the trend of the relationship between the two variables. It can also be used to make predictions about the variables. The line can vary where it is placed but always is drawn through the mean of the two variables (̅, ).

Interpolation: estimating between known data values; finding a variable based on the relationship with the other variable . Extrapolation: estimating beyond known data values; finding a variable based on the relationship with the other variable .

Example: Mathematics Standard Level, Oxford, page 342

Example: Mathematics Standard Level, Oxford, page 344 State what the slope and y-intercept are and interpret them if relevant. If not, explain why. A police chief wants to investigate the relationship between the number of times y a person has been convicted of a crime and the number of criminals x that person knows. The equation was = 0.5 + 6.

Solution:

Least Square Regression When we draw regression line we often see that the points plotted do not match up with the line. Residuals are the vertical distances between the points plotted and the regression line as seen below:

Mathematics Standard Level, Oxford, page 345

The formula for finding the gradient of a regression line is as follows: (∑ )(∑ ) (∑ ) = , ℎ = − , () = − () Price of Gasoline Note that we will use our GDC to calculate the regression line. Year July Price 1976 0.62 Example: Given the follow data: (Bureau of Labor ) 1977 0.67 a) Sketch a scatter plot with a line best fit using your GDC 1978 0.67 b) State the equation of the line of best fit 1979 0.95 1980 1.27 c) Use your equation to estimate the July price in 2013 1981 1.38 1982 1.33 1983 1.29 1984 1.21 1985 1.24 1986 0.89 1987 0.97

1988 0.97

1989 1.09 1990 1.08 1991 1.13 1992 1.17 1993 1.11 1994 1.14 1995 1.20

1996 1.27 1997 1.21 1998 1.08 1999 1.19 2000 1.59 2001 1.48 2002 1.41 2003 1.52 2004 1.94

c) = 0.02226(2013) − 43.12 = 1.69

Pearson product-moment correlation coefficient (r) is a measure of the correlation between two variables. The value is always between -1 and +1 and is used to determine the strength of the linear dependence between the two variables.

In our previous example the correlation coefficient is 0.677. Try getting this answer on your calculator.

Here are two examples from Mathematics Standard Level, Oxford, page 349

The formula for r is:

(∑ )(∑ ) (∑ ) (∑ ) = , ℎ = − , = − , = − Note that this is similar to previous formula and again not required to do by hand on the exam.

Here is a guide to what r values mean: Mathematics Standard Level, Oxford, page 350

Example: Given the number of sales in thousands for each given week: Week 2 4 6 8 10 12 14 16 18 20 Sales 1 4 5 8 13 14 14 16 18 21 a) Sketch a scatter plot with a line best fit using your GDC b) What kind of correlation is this and state the value of r c) State the equation of the line of best fit d) Use your equation to estimate the number of sales in week 11 e) Use your equation to predict the number of sales in week 28

Solution: