Further Maths Bivariate Data Summary
Total Page:16
File Type:pdf, Size:1020Kb
Further Maths Bivariate Data Summary Further Mathematics Bivariate Summary Representing Bivariate Data Back to Back Stem Plot A back to back stem plot is used to display bivariate data, involving a numerical variable and a categorical variable with two categories. Example The data can be displayed as a back to back stem plot. From the distribution we see that the Labor distribution is symmetric and therefore the mean and the median are very close, whereas the Liberal distribution is negatively skewed. Since the Liberal distribution is skewed the median is a better indicator of the centre of the distribution than the mean. It can be seen that the Liberal party volunteers handed out many more “how to vote” cards than the Labor party volunteers. Parallel Boxplots When we want to display a relationship between a numerical variable and a categorical variable with two or more categories, parallel boxplots can be used. Example The 5-number summary of each class is determined. Page 1 of 19 Further Maths Bivariate Data Summary Four boxplots are drawn. Notice that one scale is used. Based on the medians, 7B did best (median 77.5), followed by 7C (median 69.5), then 7D (median 65) and finally 7A (median 61.5). Two-way Frequency Tables and Segmented Bar Charts When we are examining the relationship between two categorical variables, the two-way frequency table can be used. Example 67 primary and 47 secondary school students were asked about their attitude to the number of school holidays which should be given. They were asked whether there should be fewer, the same number or more school holidays. The results of the survey can be recorded in a two-way frequency table and a percentage two-way table as shown below. The data can also be represented in a segmented bar chart based on the data in the second table. Clearly, secondary students were much keener on having more holidays than were primary students. Page 2 of 19 Further Maths Bivariate Data Summary Dependent and Independent Variables In a relationship involving two variables, if the values of one variable “depend” on the values of another variable, then the former variable is referred to as the dependent variable and the latter variable is referred to as the independent variable. When a relationship between two sets of variables is being examined, it is important to know which one of the two variables depends on the other. Most often we can make a judgement about this, although sometimes it may not be possible. For example, in the case where the ages of company employees are compared with their annual salaries, you might reasonably expect that the annual salary of an employee would depend on the person’s age. In this case, the age of the employee is the independent variable and the salary of the employee is the dependent variable. We always place the independent variable on the x-axis and the dependent variable on the y-axis in a scatterplot Scatterplots Page 3 of 19 Further Maths Bivariate Data Summary Example There is a moderate, negative linear relationship between the two variables Pearson’s Correlation Coefficient (r) Note that outliers can affect the accuracy of r. Page 4 of 19 Further Maths Bivariate Data Summary The coefficient of determination is useful when we have two variables which have a linear relationship. It tells us the percentage of variation in one variable which can be explained by the variation in the other variable. Example. Page 5 of 19 Further Maths Bivariate Data Summary Linear Regression If 2 variables have a moderate or strong association (positive or negative), we can find the equation of the line of best fit of the data and make predictions. The general process of fitting curves to data is called regression and the fitted line is called a regression line. Lines of Best Fit By Eye The simplest method is to plot the data on a scatter graph, and place a line over the plot by eye which seems to best represent the pattern in the data values. This method will often give the approximate location of the regression line. The 3-Median Method Page 6 of 19 Further Maths Bivariate Data Summary Step 5. Find the equation of the line (general form 푦 = 푚푥 + 푐, where 푚 is the gradient and 푐 is the y-intercept). The gradient of the 3-median line can be found by determining the gradient of the line that passes through the upper median and lower median, (푥푢, 푦푢) and (푥퐿, 푦퐿). Use the formula: 푟푖푠푒 푦푢−푦퐿 Gradient = = 푟푢푛 푥푢−푥퐿 The y-intercept 푐 can be found from the graph if the scale on the axes begins at zero. Otherwise, take a point on the final line and determine c by substitution. Using the Calculator to Determine the Equation of the 3-Median Line The following table gives the winning high jump heights for consecutive Olympic Games from 1956 to 2000. 1. Enter the data into Lists and Spreadsheets View 2. Press the Home button and enter Data and Statistics View. Page 7 of 19 Further Maths Bivariate Data Summary 3. Press the TAB key and make year the independent variable. (x value) 4. Press the TAB key and select height to make it the dependent variable. (y value) You should see the scatter plot shown below. 5. Press Menu, Analysis, Regression and Show Median-Median. You will see the plot with the 3- Median regression line as below. 6. The equation of the 3-Median regression line is given as 푦 = 0.006094푥 − 9.7751 This equation can be written as: Winning Height = ퟎ. ퟎퟎퟔퟎퟗퟒ × 풀풆풂풓 − ퟗ. ퟕퟕퟓퟏ 7. Using this rule we can predict the winning jump in 2004 by substituting Year = 2004 in the equation. Winning Height = 0.006094 × 2004 − 9.7751 = 2.43728 The regression model predicts the winning jump to be 2.44 meters in 2004. It is interesting to note that the actual winning jump in 2004 was 2.36m which shows that extrapolating data can be unreliable. Interpolation occurs when a value is substituted into the regression equation that is within the bounds of the data given. Interpolation is reliable. Predicting the winning jump for the year 1978 in the above example is interpolation. Extrapolation occurs when a value is substituted into the regression equation that is outside the bounds of the given data. Extrapolation is unreliable. Predicting the winning jump for the year 2004 in the above example is extrapolation. Page 8 of 19 Further Maths Bivariate Data Summary The Least Squares Regression Method The Least Squares Regression method finds the line that minimizes the total of the squares formed by the points and the line. We normally use CAS to generate the equation of the least squares regression line. It is given in the form 푦 = 푚푥 + 푏 on the calculator. Example: You would expect the number of skiers to depend on the depth of snow. The independent variable is the depth of snow and dependent variable is the number of skiers. Create a scatterplot of the data. This can be done on the calculator. 1. In Lists and Spreadsheet view, enter the data in the table. Page 9 of 19 Further Maths Bivariate Data Summary 2. Hit the Home button and go to Data and Statistics view. 3. Tab to the horizontal axis and select the independent variable depth and tab to the vertical axis and select the dependent variable skiers. The scatterplot will form. 4. It can be seen that there is a linear, positive strong correlation between the depth of snow and the number of skiers. There is evidence to suggest that as the depth of the snow increases the number of skiers increases. 5. Next find 푟, the coefficient of correlation and the coefficient of determination 푟2. Hit Ctrl and Left Arrow to return to Lists and Spreadsheet View. Hit Menu, Statistics, Stat Calculations, Linear Regression (mx + b). Hit the Click button and select depth from the drop down list for X List. Hit tab and select skiers from the drop down list for the Y List. Page 10 of 19 Further Maths Bivariate Data Summary There is no need to enter data into the other boxes. Tab to OK and hit Enter. The coefficient of correlation 푟 = 0.88402 This indicates that there is a strong, positive correlation between the depth of snow and the number of skiers. The coefficient of determination, 푟2 is 0.781492 We can say that 78% of the variation in the number of skiers can be explained by the variation in the depth of snow. The data also gives us the line of best fit, the least squares regression equation. 푦 = 186.418푥 + 28.3373 We can write this more clearly as 퐍퐮퐦퐛퐞퐫 퐨퐟 퐬퐤퐢퐞퐫퐬 = ퟏퟖퟔ. ퟒퟏퟖ × 퐝퐞퐩퐭퐡 퐨퐟 퐬퐧퐨퐰 퐢퐧 퐦 + ퟐퟖ. ퟑퟑퟕퟑ 6. The equation of the least squares regression line can also be determined in Data and Statistics view. Hit Ctrl + right arrow to return to your scatterplot. Hit Menu, Analyse, Regression, Show Linear (mx + b) The least squares regression equation is 푦 = 186.418푥 + 28.3373 Page 11 of 19 Further Maths Bivariate Data Summary Interpreting the Slope (Gradient) and y-intercept The slope or gradient 186.418 indicates that the number of skiers increased by 186 for every 1 metre increase in depth of snow. The y-intercept 28.3373 indicates that if the depth of snow is 0, there would be 28 skiers attending the resort. Using the Least Squares Regression Equation to make Predictions Suppose we want to estimate the number of skiers when the depth of snow is 3m.