Ch. 3 Simple Inference for Categorical Data

Christopher Kinson

Department of University of Illinois at Urbana-Champaign What is Categorical Data?

I Any value with a meaning other than numeric quantities

I Can be words and/or numbers so long as the definition is meaningful

I May be stored as counts (frequencies) or proportions (risks) in

I standard data set

I two-way tables for two categorical variables

I multi-way tables for multiple categorical variables

Data description includes

I data visualization - proc sgplot

I numerical summaries - proc freq

Inference includes

I hypothesis testing - proc freq How is Categorical Data Stored? Most data sets may not be completely made up of categorical variables, but they may have only a few. Those data sets often appear as columns of data with rows corresponding to new observations or subjects - the standard we’ve used before.

For this chapter, we will also analyze categorical data presented in tables. These tables have various names: contingency, frequency, cross-tabulated, cross-classified, or n-way tables. n is the number of categorical variables.

A 2 × 2 Column Variable Row Variable Category 1 Category 2 Total Category 1 n11 n12 n1+ Category 2 n21 n22 n2+ Total n+1 n+2 N Tests of Association Hypothesis testing for a two-way table includes

I H0: there is no association between the row variable and column variable

I HA: there is an association between the two variables

To test the null, we compare the observed and expected counts within the table.

I observed count: the frequency reported in the table i.e. the raw count (row total)·(column total) I expected count: overall total I small p-value strong evidence of association

I but we don’t know the characteristics of the association

Hypothesis tests based on deviations from expected counts Hypothesis Testing: Pearson Chi-Square

I rows i = 1,..., R and columns j = 1 ..., C PR PC I overall total N = i=1 j=1 nij I observed counts nij

I expected counts n n µˆ = i+ +j ij N I Pearson’s chi-square

R C 2 X X (nij − µˆij ) X 2 = µˆij i=1 j=1

I degrees of freedom df = (R − 1)(C − 1) 2 I larger values of X imply more evidence against the null hypothesis Hypothesis Testing: Likelihood Ratio Chi-Square

I likelihood ratio chi-square statistic

R C 2 X X G = 2 nij log(nij /µˆij ) i=1 j=1

I degrees of freedom df = (R − 1)(C − 1) 2 I larger values of G imply more evidence against the null hypothesis 2 I alternative to Pearson’s X Measures of Association: Phi Coefficient 2 I based on X statistic and overall total N

I take the square root of X 2 φ2 = N

I small values weak association

I large values mean strong association I for a 2 × 2 contingency table:

I phi coefficient has same interpretation as Pearson correlation n11n22 − n12n21 φ = √ n1+n2+n+1n+2

I values between [−1, +1]

I values near −1 mean strong negative association

I values near +1 mean strong positive association

I values near 0 mean weak or no association Measures of Association: Contingency Coefficient

2 I based on X statistic and overall total N

r X 2 cc = N + X 2

I values always less than 1

I small values mean weak association

I large values mean strong association

I Useful for a square contingency table larger than 2 × 2 Measures of Association: Cramer’s V Coefficient

I based on phi coefficient

s φ2 V = min(R − 1, C − 1)

I R is number total number of rows

I C is total number of columns

I values range between [0, 1]

I small values mean weak association

I large values mean strong association

I for a 2 × 2 contingency table:

I Cramer’s V is the same as phi coefficient Using proc freq

I proc freq can produce the aforementioned tests for association and measures of association

I tables statement sets up the frequency table containing counts, overall percentages, row percentages, and column percentages by default

I chisq - runs chi-square and other tests of association & shows measures of association

I expected - shows the expected counts for each cell

I deviation - shows the residual (observed-expected) for each cell

I cellchi2 - shows the chi-square contribution for each cell

I norow - hides the row percentages

I nocol - hides the column percentages

I nopercent - hides the overall percentages

I noprint - does not print the table Using proc freq (cont.)

When the data exists as a contingency table, the weight statement is used to notify SAS that the values represent observed counts

proc freq data=dataName; tables rowVarName*columnVarName / chisq expected; *weight n; run; Oral Contraceptives Data Women patients in several hospitals were asked whether they use oral contraceptives (such as birth control pills). Each case was a married woman who suffered from blood clots (idiopathic thromboembolism) over a 3-year period. These cases were matched to controls - women without blood clots who were discharged alive from the same hospital in the same time interval. What we care about is if there’s an association between suffering from blood clots and oral contraceptive usage. See Sartwell et al. (1969) “Thromboembolism and oral contraceptives: An epidemiological case-control study” for more details.

Oral Contraceptive Usage: Controls Oral Contraceptive Usage: Cases Used Not Used Used 10 57 Not Used 13 95 Oral Contraceptives Data (cont.)

Manually reading it into SAS!

data pill; input caseuse $ controluse $ n; cards; Y Y 10 Y N 57 N Y 13 N N 95 ; Heart Attack Data

A 5-year randomized study was done on male physicians to study aspirin’s affect on cardiovascular disease. The physicians either took one aspirin or one placebo, and they did not know they type of pill they took. What we care about is if the placebo group and the aspirin group experience heart attacks (myocardial infarctions) similarly. See Preliminary Report: Findings from the Aspirin Component of the Ongoing Physicians’ Health Study (1988) for more information.

Myocardial Infarction Drug Group Yes No Placebo 189 10845 Aspirin 104 10933 Heart Attack Data (cont.)

Manually reading it into SAS!

data heart; input group $ attack $ n ; datalines; placebo yes 189 placebo no 10845 aspirin yes 104 aspirin no 10933 ; Car Accidents Data The data set is a subset of the National Automotive System’s (NASS) Crashworthiness Data System (CDS) which contains a stratified random sampling of nationwide police-reported crashes between years 1997-2002. CDS data focus on passenger vehicle crashes, and are used to investigate injury mechanisms. Something interesting to investigate is the debunking of the horrible stereotype that women are inadequate drivers. The data contains 26218 observations and will retain 6 of the original 15 variables.

filename adata url "https://tinyurl.com/y8hujqvk"; data accidentDB; infile adata dsd dlm=’09’x truncover firstobs=2; input weight dead $ airbag $ seatbelt $ frontal sex $ ageOFocc yearacc yearVeh abcat $ occRole $ deploy injSeverity; keep dead weight sex occrole yearacc ageofocc; run; proc print data=accidentDB (obs=50); run; Steam Video Game Data

This is a subset of the steam-200k data set on Kaggle with an added ESRB rating column. Steam is a very popular online gaming hub with hundreds of thousands of users playing millions of hours of video games. The video games featured on Steam are of various genre and can be found across multiple consoles such as Playstation, XBox, Wii, and many others. The data is a random selection of 100 users and the game they spent the most hours playing. The variables in the data include user ID, game title, hours played, ESRB rating, genre, and a binary variable indicating whether (1) or not (0) they played more than 40 hours of that game. The ESRB rating does not apply to the online gaming experience, but I use it anyways. A question of interest is if there’s an association between the ESRB rating and the number of hours played. Steam Video Game Data (cont.)

Reading it into SAS!

filename vgdata url "https://tinyurl.com/ya2cvtt6"; data game; infile vgdata dsd dlm=’09’x truncover firstobs=2; input userID title $ hoursplayed Rating $ Genre $ over40hours ; run; proc print data=game; run; Weight Perception Data

This data comes from a national youth survey, consisting of a nationally representative sample of young people ages 14 to 20 years old as of December 31, 1999 who self report about their weight perception with the prompt “How would you describe your weight?” Variables include age, height (in inches), weight (in pounds), sex, and categorical response about weight perception. Do upperclassmen feel better about their weight than underclassmen? Do young women tend to feel more strongly about their weight than young men?

filename wdata url "https://tinyurl.com/yc7dnv8p"; data teenweightDB; infile wdata dsd dlm=’09’x truncover firstobs=2; input gender $ age height weight weightperception $16.; run; proc print data=teenweightDB; run; Exercise: Car Accidents Data

1. Create bar graphs of the variables: sex and occupant’s role. Make sure the title of the graph says: “Bar Graph of Occupant’s Role”. 2. Report the frequencies and expected values for the two variables. 3. Run a Pearson X 2 test of association. What conclusions do you draw from the results? 4. Use the measures of association to describe the strength of association between the sex and occupant’s role. Hypothesis Testing: Risk Difference For 2 × 2 tables, I risks are binomial proportions

I row percentage from proc freq by default

I comparing proportions of row 1 to row 2 in the table I We can test whether the difference in risks is significant

I H0: risk1 − risk2 = 0 I HA: risk1 − risk2 6= 0 I We can find confidence intervals for the difference in risks I Asymptotically normal under the null I If the individual risks are very close to 0, then

I The risk difference test results can be misleading

I Compute the odds ratio for an alternative interpretation

proc freq data=dataName; tables rowVarName*columnVarName / riskdiff; *weight n; run; Measure of Association: Odds Ratio

I For 2 × 2 tables, sample odds ratio OR = odds1 = n11/n12 = n11n22 odds2 n21/n22 n12n21 I OR = 1 - the row and column variables are independent (no association)

I OR >> |1| - strong association

I OR > 1 - subjects in row 1 are more likely to have a success than subjects in row 2

I OR < 1 - subjects in row 1 are less likely to have a success than subjects in row 2

I Confidence intervals based on log(OR) Measure of Association: Odds Ratio (cont.) For two binary variables, the sample odds ratio OR = 1.25 has the following equivalent interpretations: 1. the odds of success in row 1 are 1.25 times the odds of success in row 2 2. odds of success are 1/1.25 = 0.8 times as high in row 2 than in row 1 3. the odds of success are 25% higher for row 1.

I this 3rd interpretation makes sense when the odds ratio is 1 < OR < 2

/* To show the odds ratios only*/ proc freq data=dataName; tables rowVarName*columnVarName / or(cl=Wald); *weight n; run; Hypothesis Testing: Fisher’s Exact

For 2 × 2 contingency tables

I Appropriate for a table containing small expected cell frequencies and/or small sample size

I In theory, we assume the table contains row total and column total and the statistic relies on n11 I This test is very conservative

I reject only for really small p-values For larger tables

I Appropriate when data is sparse

I sparse: several cells with frequency of 0 Hypothesis Testing: Fisher’s Exact (cont.)

/* for 2 by 2 tables you can use*/ proc freq data=dataName; tables rowVarName*columnVarName / chisq; *weight n; run;

/* for general sized two-way tables you can use*/ proc freq data=dataName; tables rowVarName*columnVarName / exact; *weight n; run; Hypothesis Testing: Mantel-Haenszel A special kind of association test

I Test for linear association (as categories increase/decrease in value) I Appropriate for two ordinal variables

I containing ordered categories

I e.g. age groups listed from youngest to oldest, level of agreement ordered from strongly disagree to strongly agree I Asymptotically Chi-square with 1 degree of freedom

I H0: no linear association I HA: increases/decreases in one variable are associated with increases/decreases in the other variable proc freq data=dataName; tables rowVarName*columnVarName / chisq; *weight n; run; Hypothesis Testing: McNemar’s

I Appropriate for 2 × 2 tables with matched pairs design

I Some matched pairs designs include

I Case-control studies

I Studies about twins

I Studies of one group of subjects at two time points

I H0: the two marginal proportions are the same n+1 = n1+ & n+2 = n2+

I HA: the two marginal proportions are not the same

/* To show McNemar’s Test and the Kappa Coefficient*/ proc freq data=dataName; tables rowVarName*columnVarName / agree; weight n; ods select McNemarsTest ; run; Hypothesis Testing: Exact Tests

I When sample sizes are small but the expected frequencies are not a problem, you can use an exact test for the appropriate test of association I Some examples of exact tests:

I chisq - exact Pearson, Likelihood ratio, and Mantel-Haenszel chi-square tests

I pchi - exact Pearson chi-square

I lchi - exact likelihood ratio chi-square

I mchi -exact Mantel-Haenszel chi-square

I fisher - Fisher’s exact test

I or - exact confidence limits for odds ratio

proc freq data=dataName; tables rowVarName*columnVarName / chisq; *weight n; exact chisq; run; Additional Guidance Using the order=data option as part of the proc freq statement will print the contingency table without alphabetizing the categories, but instead print the table with the order of appearance of the categories.

I If the first entry in a data set for one variable, say gender for example, appears as women and the second unique entry is men, then the contingency table will list women as the first category then men as the second.

In general, if the sample size is too small, then exact tests should be used. The exact statement can be used to produce some results as well as an exact option for other methods. For more details and/or guidance, check any of the following links

I SAS Procedures by Name and Product

I Chi-Square Tests and Statistics

I UCLA’s Proc freq SAS Annotated Output Exercise: Oral Contraceptives Data

1. Create a frequency table showing the counts, percentages, and expected counts. Do any cells have large frequencies? 2. Run a test of association and state why you chose this test. Comment on the test results. 3. Give a confidence interval for the difference in the proportion of women with blood clots who use oral contraceptives and the corresponding proportion of women not suffering from blood clots. What does this interval tell us? Exercise: Weight Perception Data

1. Create a subset of the data with weights larger than 0 and where teens are younger than 19 years old. 2. Create a status variable which is categorical such that teens ages younger than 17 are labeled ‘Lowerclassman’ and teens ages 17-18 are labeled as ‘Upperclassman’. 3. Obtain the expected counts and the chi-square contributions for the table with status vs weightperception for each cell. Make sure the ordering is in the same direction for both variables. 4. Run the Manel-Haenszel test for linear association on the table with variables variables status and weightperception. Comment on the results. 5. Run the Manel-Haenszel test for linear association on the table with variables variables sex and weightperception. Interpret the results. 6. Compute the odds ratios and their confidence levels for the table with sex and weightperception. Interpret the results. Exercise: Steam Video Game Data

1. Print the data to see if some of your favorite games are in this data set. 2. Make a subset of the data for the games with mature and everyone ratings. Create a new variable that is a 1 for mature games and 0 for everyone games. 3. Suppose anyone who plays video games more than 40 hours is extreme. Is there an association for the rating of a game and the extreme gamers? 4. Which test is appropriate for this setting and why? Exercise: Heart Attack Data

Using the table from above complete the following questions. 1. Read in the data using SAS. 2. Obtain the expected counts and the chi-square contributions for each cell. 3. Test for association and comment on the results and state why you chose this test. Comment on the test results. 4. Obtain risk estimates to see if the difference is significant. One book found the 95% CI for risk difference between the placebo group who experienced a heart attack and the aspirin group who experienced a heart attack to be 0.008 ± 0.003 or (0.005, 0.011). 5. Compute the odds ratios and their confidence levels. Interpret the results. Exercise: Car Accidents Data

1. Obtain risk estimates to see if there’s any difference in the occupant’s role for the males and females. 2. Interpret the odds ratio and its confidence interval for the 2 × 2 table. 3. Suppose we created subset by matching each female driver with a male driver and randomly choosing 8000 of pairs. We checked whether the accidents resulted in death. Using Table 1 below, what can we conclude from a test of association for this data?

Table 1 Male drivers: Dead? Female drivers: Dead? Dead Alive Dead 10 271 Alive 398 7321 Exercise: Car Accidents Data (cont.)

4. We randomly selected 4000 male and 4000 female occupants and checked whether they died in traffic accidents. These results are in the Table 2. Determine the risk difference of female vs male death for the table and comment on the results.

Table 2 Dead? Sex Dead Alive Female 139 3861 Male 193 3807