Excerpt of draft Categorical Association mini-Module. Copyright 2017 Casey and Ross STATISTICAL KNOWLEDGE FOR TEACHING CATEGORICAL ASSOCIATION

INTRODUCTION This draft material was meant to be supplemental to introductory textbooks, whether they are algebra-based or calculus-based. So, it does not attempt to teach the mechanics of formal hypothesis testing (e.g., Chi-Squared computations). Please consult the many free online introductory statistics books for that material (see the References for links). Future versions of this material will include more intro-stats material rather than being supplemental. This material is designed to address Statistical Knowledge for Teaching regarding categorical association. This includes two main components: Subject Matter Knowledge and Pedagogical Content Knowledge. The first involves material that is typically contained in ordinary statistics courses, but algebra-based courses tend to skip some parts, and calculus-based courses tend to skip other parts, so here we present a unified (but non-calculus) view appropriate for future teachers. Pedagogical content knowledge is what teachers specifically need to know. In this material, you will be discussing common student misconceptions, responding to students’ work, constructing lessons that use key pedagogical ideas, and considering standards from the Common Core State Standards for Mathematics (CCSS-M) and the Advanced Placement Statistics (college-level introductory statistics course offered at high schools) curriculum (see page 31 for the standards). TABLE OF CONTENTS (for entire mini-module; those in this excerpt bolded) Overview 2 Activity 1: Table Construction 2 Two Big Questions 6 Activity 2: Joint, Marginal, and Conditional Relative Frequencies 6 Joint Relative Frequencies 7 Marginal Relative Frequencies 7 Conditional Relative Frequencies 8 Independence 9 Activity 3: Analyzing student thinking 14 Activity 4: Graphing and EDA 20 Comparative Bar 20 Segmented Bar Graphs 22 Other plots you might see: Mosaic/Ribbon Charts 24 Other plots you might see: Pie Charts 24 Activity 5: Association versus Causation 25 Activity 6: Chi-square analysis thoughts 28 A Simpler Explanation 30 Activity 7: Curriculum Standards: CCSS versus AP Statistics 32 Activity 8: 35 References 40 Free Online Statistics Books 40 Bibliography 40 Homework 41

1

Excerpt of draft Categorical Association mini-Module. Copyright 2017 Casey and Ross Activity 3: Analyzing student thinking Do the following two tasks:

TASK 1 (Smoking): In a medical center 250 people have been observed in order to determine whether the habit of smoking has some relationship with bronchial disease. The following results have been obtained: Bronchial No bronch. Total Question 3-a: For this sample of people, is disease disease bronchial disease associated with smoking? Smoke 90 60 150 Explain your answer. Not smoke 60 40 100 Total 150 100 250

TASK 2 (Drug): We are interested in assessing if a certain drug produces digestive troubles in old people. For a sufficient period, 25 old people have been studied, and these results have been obtained: Digestive No diges. Total Question 3-b: Using the in this troubles troubles table, for this sample of old people, is Drug taken 9 8 17 digestive trouble associated with taking the No drug 7 1 8 drug? Total 16 9 25

What follows are some sample student responses to these tasks that illustrate misconceptions students have when doing tasks like these. These student responses are real student responses to these tasks, or ones like them, which have come out of research studies done by statistics educators. Set 1 David (Smoking task): Yes, there is an association. Although both the percentage of smokers with bronchial disease and the percentage of non-smokers with bronchial disease are 60%, there are more smokers with bronchial disease than nonsmokers.

Nathan (Smoking task): Ummm wouldn’t it make more sense though that they would have to interview the same amount of people because if they did then they would have more yeses and more nos for the not-smoking population.

Question 3-c: With your partner, discuss and document: 1) What misconceptions do these students have?

2) How are their misconceptions related? i.e., What prerequisite knowledge do these students appear to be lacking?

2

Excerpt of draft Categorical Association mini-Module. Copyright 2017 Casey and Ross

3) Your response to this pair of students as their teacher. Be specific in your description of your response, writing exactly what you would say and/or drawing anything you would use in your response.

Set 2—[Deterministic misconception—removed for this excerpt]

Set 3—[Unidirectional misconception—removed for this excerpt]

Set 4—[Localist misconception—removed for this excerpt]

Set 5—[Ignoring , Using Only Preconceived Notions—removed for this excerpt]

In summary, these are the common student misconceptions students have when analyzing categorical data for associations:  Lack of proportional reasoning  Deterministic  Unidirectional  Localist  Use of only intuition and ignoring the data

Learning about these common misconceptions that students have regarding categorical association is meant to better prepare you to teach this topic because you can anticipate student issues and incorporate that knowledge into your planning for instruction on the topic.

Other pedagogical points which research has found helps students learn categorical association are:

 Presenting all of the data simultaneously rather than one case at a time  Working with data from meaningful contexts  Using a small number of data points

3

Excerpt of draft Categorical Association mini-Module. Copyright 2017 Casey and Ross  Telling students that inverse association is possible, and it’s still association.  Including examples of inverse association and no association

4

Excerpt of draft Categorical Association mini-Module. Copyright 2017 Casey and Ross Activity 4: Graphing and EDA Now we turn to the various ways to graph contingency tables, and ways of doing exploratory (EDA).

Comparative Bar Charts Consider a generic survey with two questions; each person answers either Pro (in favor), Undecided, or Con (against) to question 1 and question 2 [note: we will be updating this example to make it less generic, more interesting]. Suppose we get the following table:

Table: Two polling questions Pro2 Und2 Con2 Totals: Pro1 105 15 30 150 Und1 42 6 12 60 Con1 63 9 18 90 Totals: 210 30 60 300

In a previous activity, we found that it shows perfect independence. Now let’s consider how to graph it. One graph that many people think to start with is bars showing each cell’s count. This is called a side-by-side bar graph or a comparative bar . There are two ways to group the bars: Two Survey Questions: Two Survey Questions: Comparative Bar Chart Comparative Bar Chart 120 120 100 100 80 Pro1 80 Pro2 60 60 Und1 Und2 40 40 Frequency Con1 Frequency Con2 20 20 0 0 Pro2 Und2 Con2 Pro1 Und1 Con1

Figures 4-A and 4-B Both of these suffer the same problem, though: since there were far more people in the Pro1 category than Und1 or Con1 (150 versus 60 or 90), we expect to see taller bars for Pro1 regardless of any association between the variables. We can’t compare bar heights without also mentally comparing to the number of people in each category with these graphs of frequency. Question 4-a: Give at least three ways you can tell that these graphs above aren’t showing relative frequencies.

Question 4-b: If a student said “the Con1 bar for Pro2 is much taller than the Con1 bar for Und2, so the variables are associated because Con1 response varies with their position on question 2”, how would you respond?

5

Excerpt of draft Categorical Association mini-Module. Copyright 2017 Casey and Ross Instead of graphing the simple frequencies, we must compute relative frequencies based on the number of people on each row (or each column) to get graphs we can easily interpret. That is, we need to graph the conditional relative frequencies instead of the joint. We can use either row- or column-based conditionals, and we can group them in two ways, which we have 4 possible plots. First, consider these two: Two Survey Questions: Two Survey Questions: Comparative Bar Chart Comparative Bar Chart 80% 80%

60% 60% Pro1 Pro2 40% 40% Und1 Und2 20% Con1 20% Con2 Relative Frequency Relative Frequency 0% 0% Pro2 Und2 Con2 Pro1 Und1 Con1

Figures 4-C and 4-D

Question 4-c: Are these showing us the conditional relative frequencies of Question 1 given Question 2, or vice versa? Explain. Hint: what adds to 100%?

The graph on the left shows perfect independence between the questions because the chance of Pro2 does not change (70%) regardless of whether the person was Pro1, Und1, or Con1, and similarly for the other positions (Und2, Con2) on question 2. That is, for independence, we are looking for the bars in each group to be the same height in this type of graph. If the bar heights in a group are not the same, we have evidence of statistical association. Notice that this might feel backwards: in non-technical language, people who are associated with each other have something in common (for example, labor unions or professional organizations are often called Associations), but here the association is shown by different bar heights based on how people answered question 1 in the poll.

The graph on the right shows perfect independence between the questions because the relative frequencies of answers to question 2 does not change from one group of bars to another. That is, the RFs do not depend on what someone answered on question 1. Here we are looking at the height of the leftmost bar in each group and noting they are all the same, and similarly the middle bars in each group all have the same height, and the rightmost bars too. It is tempting to say that the shape of each group is the same, but the “shape” of the bars is a word we should not use for categorical variables since their ordering is arbitrary.

6

Excerpt of draft Categorical Association mini-Module. Copyright 2017 Casey and Ross

Segmented Bar Graphs Another option for graphing the data is called a 100% Stacked Bar Graph or Segmented Bar Graph. Each thing we condition on is represented as a bar that goes to 100%, and is divided into regions proportional to the probabilities of the outcomes. Using the perfectly- independent data from before, we have these graphs:

Two Survey Questions: Two Survey Questions: Segmented Bar Chart Segmented Bar Chart 100% 100% 80% 80% 60% Con1 60% Con2 40% Und1 40% Und2 20% Pro1 20% Pro2 0% 0% Pro2 Und2 Con2 Pro1 Und1 Con1

Figures 4-E and 4-F On the left plot, the probability of someone being Pro1 given that they are Pro2 is about 50%; the probability of Con1 given Pro2 is about 30%, which is read down from the top of the graph as 100%-70%.

Question 4-d: What is the probability of Pro2 given Pro1 on the graph on the right? __

Question 4-e: What is the probability of Con2 given Pro1 on the graph on the right? __

As above, association is evident when the segments have different heights. Just one mismatch is enough to be evidence of association. We don’t need mismatch on every bar height. But again, if we find only small mismatches, they would count as only weak evidence of association (depending on the sample size).

Question 4-f: Write a sentence about how the graph on the left shows evidence of association or of independence.

Question 4-h: Looking back at comparative bar charts and segmented bar charts, which type seems better for displaying evidence of association?

7

Excerpt of draft Categorical Association mini-Module. Copyright 2017 Casey and Ross Activity 7: Curriculum Standards CCSS-M Below are two standards from the CCSS-M www.corestandards.org that focus on association of categorical data.

Grade 8: 8.SP.4. Understand that patterns of association can also be seen in bivariate categorical data by displaying frequencies and relative frequencies in a two-way table. Construct and interpret a two-way table summarizing data on two categorical variables collected from the same subjects. Use relative frequencies calculated for rows or columns to describe possible association between the two variables. For example, collect data from students in your class on whether or not they have a curfew on school nights and whether or not they have assigned chores at home. Is there evidence that those who have a curfew also tend to have chores?

Secondary: HSS-ID.B.5. Summarize categorical data for two categories in two-way frequency tables. Interpret relative frequencies in the context of the data (including joint, marginal, and conditional relative frequencies). Recognize possible associations and trends in the data.

Question 7-a: What pre-requisite knowledge do students need in order to be prepared to learn the content described in standard 8.SP.4?

Question 7-b: Both of these standards explicitly mention tables but neither mention graphs. Should teachers use graphs to help students understand association of categorical variables? If so, why and in what ways? If not, why?

8

Excerpt of draft Categorical Association mini-Module. Copyright 2017 Casey and Ross The following standard is also in the CCSS-M, but is listed as a probability standard: Secondary: S-CP.4. Construct and interpret two-way frequency tables of data when two categories are associated with each object being classified. Use the two-way table as a sample space to decide if events are independent and to approximate conditional probabilities.

Question 7-c: What are similarities between the probability standard and the categorical association standards?

Question 7-d: What are differences between the probability standard and the categorical association standards?

Did you know there are some great opportunities for cross-curricular teaching with the topic of categorical data analysis? For example, the Next Generation Science Standards (http://www.nextgenscience.org/) include the following standards for eighth grade: Use data in a two-way table as evidence to support an explanation of how environmental and genetic factors affect the growth of organisms; that different local environmental conditions impact growth in organisms; of how social behavior and group interactions benefit organisms’ abilities to survive and reproduce (MS-LS1 and LS2). Geography (http://education.nationalgeographic.com/education/national- geography-standards/?ar_a=1) , economics (http://www.councilforeconed.org/resource/voluntary-national-content-standards-in- economics/), and social studies (http://www.nchs.ucla.edu/history- standards/common-core-standards-1-1) are other subject areas whose standards also overlap with statistics.

9

Excerpt of draft Categorical Association mini-Module. Copyright 2017 Casey and Ross

AP Statistics The AP Statistics curriculum http://media.collegeboard.com/digitalServices/pdf/ap/ap-statistics-course-description.pdf also includes standards relevant to association of categorical variables. They include:

I.E. Exploring categorical data 1. Frequency tables and bar charts 2. Marginal and joint frequencies for two-way tables 3. Conditional relative frequencies and association 4. Comparing distributions using bar charts

III.D.8: distribution for chi-square IV.B.6: Chi-square test for independence

The I.E. standards overlap with those seen from the CCSS-M, but the addition of standards relevant to the Chi-Square distribution and test for independence show an extension into procedures for those students in AP Statistics. While it is dependent upon ideas developed previously, the Chi-square test for independence requires a shift on the part of the student to think differently about data and its variability. Its measurement of how much the observed data differ from what one would expect if the variables were independent inherently allows for the data to vary from the expected due to natural sampling variability. The important question is: do the data vary so much from the expected values (calculated under the assumption of independence) that we don’t think the variables are independent. This is a new and important concept that statistical inference emphasizes. Question 7-e: Name at least two ways the Chi-square test for independence procedure depends upon ideas developed when exploring categorical data. i)

ii)

iii)

10