OpenStax-CNX module: m46177 1

Bivariate Descriptive : Introduction*

Irene Mary Duranczyk Suzanne Loch Janet Stottlemyer

This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0„

Abstract This module included categorical-categorical, cateogorical-measurement, and measurement-measurement descriptive statistics It also contains information from the descriptive statistics chapter part of Collabo- rative Statistics collection (col10522) by Barbara Illowsky and Susan Dean.

1 Descriptive Statistics for Bivariate : Introduction 1.1 Student Learning Outcomes By the end of this chapter, the student should be able to:

• Identify dierent types of relationships between variables: categorical-categorical, categorical-numerical, and numerical-numerical

Categorical - Categorical

• Know how to make and summarize a between two categorical variables • Discuss and describe patterns in contingency tables and supporting graphic summaries Categorical - Numerical

• Be able to build side-by-side histograms, back-to-back stemplots, and multiple box-plots between a categorical and a numeric variable. • Compare and contrast multiple groups based on their shape, center, and spread. Numerical - Numerical

• Discuss basic ideas of the relationship between two numeric variables including scatter plots, linear regression and correlation • Create and interpret a line of best t • Calculate and interpret the correlation coecient

*Version 1.2: May 2, 2013 3:47 pm -0500 „http://creativecommons.org/licenses/by/3.0/

http://cnx.org/content/m46177/1.2/ OpenStax-CNX module: m46177 2

1.2 Introduction In the previous chapters you have explored how to organize, summarize, and discuss both univariate cat- egorical and numerical variables. Usually more interesting questions are to examine relationships between sets of variables. For example, are men or women more likely to purchase coee at a coee shop or do stu- dents that spend more time studying for an exam really do better? We can start to answer these questions by examining the relationship between the variables: categorical and categorical variables, categorical and numerical variables, or numerical and numerical variables. We will do this by producing graphs, calculating , and making comparisons.

1.3 Categorical-Categorical Relationships: Contingency Tables When we want to examine a relationship between two categorical variables we build a contingency table. Tables that summarize two categorical variables are called contingency tables. These tables are also called two-way tables, crosstabulations, summary tables, or pivot tables (in Microsoft Excel). The table below is an example of a contingence table. It presents the number of students that fall into six dierent groups. The groups are based on answers to two categorical questions. The rst question asked for gender (Female or Male) and the second asked for the type of transportation a student typically uses to go to school each day (bicycle, car, or walking).

Figure 1

In this situation, we get two measurements from each person: a person's gender, and a person's type of transportation. One variable is represented by the rows (GENDER) and the other variable is represented by the columns (TRANSPORTATION). There are two values for gender (Female or Male) and three possible values for transportation (Bike, Car, or Walk). Each person can have only one value for gender and only one value for transportation. Together, the two variables are said to make a 2 x 3 table, two rows and three columns. This creates a table with six cells. The cells are the boxes that are outlined with a heavy line. We are interested in whether there is a relationship between a person's gender and how they typically get to school. In other words, is there a dierence between men and women with respect to the type of transportation they use to get to school? To answer this question we must look at percentages, since the numbers of men and women are unequal. Notice that the question about the dierence in the relationship is stated in terms of whether or not the percentages are the same. To answer this question, you must rst calculate the percentages for each box in the table so that you can make a comparison. However, there are three dierent types of percentages that you can calculate, and you need to choose the right type to make a meaningful comparison. You can calculate a percentage based on the Table Total for the entire table, or on the Column Totals, or on the Row Totals. Start o by calculating the row totals, column total, and the overall total (or sample size). There are two rows, so there will be two row totals, one for the Females and one for the Males. For the females, there

http://cnx.org/content/m46177/1.2/ OpenStax-CNX module: m46177 3 are 4 who ride a bike, 20 who use a car, and 45 who walk. The Row Total for females is, therefore, 4 + 20 + 45 = 69. Similarly, you should nd that the row total for males is 61. Write the two row totals (69 and 61) into the boxes provided under ROW TOTAL.

Figure 2

There are three columns, so you will have three column totals. The column total for Bike is 17. Write that in for the Bike column total, and then calculate and ll in the column totals for Car and Walk. The Table Total is just the sum of the counts in all six cells. So, you can either add up all of the cell counts, or you can add up just the two row totals, or you can add up just the three column totals. Any one of the three methods will give you exactly same value for the Table Total. In fact, it is a good idea to calculate the Table Total in at least two ways to make sure the two calculations agree. Now that you have all of the totals, it's time to decide which type to use to calculate the percentages for each cell. To do this, you need to consider the question that is asked: Is there a dierence between men and women with respect to the type of transportation they use to get to school? The question focuses on the dierence between men and women, so gender is our dominate variable. Since men and women are represented by the rows in the table, we want to use the ROW TOTALS to calculate ROW percentages. Calculate the row percent for each cell. For example, there are 4 women who bike. The Row Total for women is 69. So the Row % for that cell (females who bike) is equal to 4/69 x 100 = 5.8%. And that is what you write after ROW % = in the rst cell. As another example, there are 23 men who walk. The Row Total for men is 61. Therefore, the Row % for that cell (men who walk) is 23/61 x 100 = 37.7%. Use this same method to calculate the other four row percentages and write them into the table.

http://cnx.org/content/m46177/1.2/ OpenStax-CNX module: m46177 4

Figure 3

Now that you have all six percentages, you can compare them. You will certainly see that the percentages are not the same for men and women. Are the dierences in percentages between males and females large enough for you to claim that there is a relationship between gender and type of transportation to school? In other words, is there a type of transportation that males tend to take more so than females, or vice versa? When describing a relationships between two categorical variables represented in a two-way table describe any similarities and dierences paying particular attention to the largest percentages based on your dominate variable. For this relationship there does seem to be a dierence between men and women with respect to type of transportation used to get to school. Men are closely split between taking the car at 41% and walking at 37.3%. While the women's biggest percentage is for walking at 65.3% with driving a car at 29%. Here is a summary of the steps needed to identify whether you should calculate row or column percentages:

• Read the question carefully • Calculate the Row total, Column total, and Overall total. • Based on the question, determine whether Row percentages or Column percentages should be used to answer the question. • Calculate the appropriate percentages (either Row or Column) • Remember to label the percentages as either ROW or COLUMN. • Finally, write a few sentences that compare the percentages in order to answer the question (Is there a relationship or are the variables independent?)

Graphical Displays of Contingency Tables The relationship you are examining in a contingency table can also be viewed graphically with multiple bar graphs, multiple pie charts or a segmented bar chart. The following graph of Male and Female Modes of Transportation represents all the data from the contingency table in a bar graph using row percents. The graph on the left uses clustered columns to display the data. The graph on the right uses stacked columns to display the data. Also take note that depending on your variable of interest, you can organize your data on the x-axis by mode of transportation or by gender.

http://cnx.org/content/m46177/1.2/ OpenStax-CNX module: m46177 5

Figure 4

This data can also be represented by multiple Pie Charts or Graphs. The graphs below demonstrate how the data can be represented by two pie charts. If we used mode of transportation to dene the groups for pie charts, we would need to display three pie charts with the percent male and female on each chart.

Figure 5

When making any of these graphs to show the relationship between the two categorical variables always include the percentages, not the frequencies. The percentages allow us to compare groups with dierent sample sizes. However, somewhere in your narrative or in the description of your data the actual counts should be represented so that the reader knows the basis of the percentages. Including a contingency table with counts would ll that need.

http://cnx.org/content/m46177/1.2/ OpenStax-CNX module: m46177 6

1.4 Categorical-Numerical Relationships: Comparing Distributions with two or more groups Another type of relationship between variables that we are interested in examining is between a categorical and numerical variable. All the summary statistics and graphing techniques that we learned about in the previous chapter can be used to compare multiple groups on the same variable. Have you ever wondered who gets more speeding tickets, men or women, or who spends more time studying on your campus, freshmen, sophomores, juniors or seniors? These questions can be examined by constructing multiple histograms, back-to-back stemplots, and multiple box-plots. Below is an example of a box plot showing the dierences between the number of hours per week male and female students studied for their statistics course.

Figure 6

When comparing groups with any of these displays we want to pay special attention to dierences and similarities is shape, center, spread. You will want to compare the group's shapes, comment on any gaps, multiple modes or peaks, and outliers you see in the data. This is easy to do when the data is graphed on the same scale as seen in the graphs above. Next compare and contrast the measures of center: mean, median, and mode. These statistics describe what is a typical response to a numerical question. Are these measurements the same between groups or not. Lastly, examine their measures of variability: range, standard deviation, and interquartile ranges. Are these measurements of variability and spread similar or dierent? Remember our purpose in examining these groups is to see if there is a dierence. We are looking to see if there is a relationship between the multiple

http://cnx.org/content/m46177/1.2/ OpenStax-CNX module: m46177 7 groups.

1.5 Numerical - Numerical Relationships: Linear Regression and Correlation Professionals often want to know how two or more numeric variables are related. For example, is there a relationship between the grade on the second math exam a student takes and the grade on the nal exam? If there is a relationship, what is it and how strong is the relationship? In another example, your income may be determined by your education, your profession, your years of experience, and your ability. The amount you pay a repair person for labor is often determined by an initial amount plus an hourly fee. These are all examples in which regression can be used. The type of data described in the examples is bivariate data - "bi" for two variables. In reality, statisticians use multivariate data, meaning many variables. In the next section, you will be studying the simplest form of regression, "linear regression" with one independent variable (x). This involves data that ts a line in two dimensions. You will also study correlation which measures how strong the relationship is.

http://cnx.org/content/m46177/1.2/