Ch. 3 Simple Inference for Categorical Data

Ch. 3 Simple Inference for Categorical Data Christopher Kinson Department of Statistics University of Illinois at Urbana-Champaign What is Categorical Data? I Any value with a meaning other than numeric quantities I Can be words and/or numbers so long as the definition is meaningful I May be stored as counts (frequencies) or proportions (risks) in I standard data set I two-way tables for two categorical variables I multi-way tables for multiple categorical variables Data description includes I data visualization - proc sgplot I numerical summaries - proc freq Inference includes I hypothesis testing - proc freq How is Categorical Data Stored? Most data sets may not be completely made up of categorical variables, but they may have only a few. Those data sets often appear as columns of data with rows corresponding to new observations or subjects - the standard we've used before. For this chapter, we will also analyze categorical data presented in tables. These tables have various names: contingency, frequency, cross-tabulated, cross-classified, or n-way tables. n is the number of categorical variables. A 2 × 2 contingency table Column Variable Row Variable Category 1 Category 2 Total Category 1 n11 n12 n1+ Category 2 n21 n22 n2+ Total n+1 n+2 N Tests of Association Hypothesis testing for a two-way table includes I H0: there is no association between the row variable and column variable I HA: there is an association between the two variables To test the null, we compare the observed and expected counts within the table. I observed count: the frequency reported in the table i.e. the raw count (row total)·(column total) I expected count: overall total I small p-value means strong evidence of association I but we don't know the characteristics of the association Hypothesis tests based on deviations from expected counts Hypothesis Testing: Pearson Chi-Square I rows i = 1;:::; R and columns j = 1 :::; C PR PC I overall total N = i=1 j=1 nij I observed counts nij I expected counts n n µ^ = i+ +j ij N I Pearson's chi-square statistic R C 2 X X (nij − µîj ) X 2 = µîj i=1 j=1 I degrees of freedom df = (R − 1)(C − 1) 2 I larger values of X imply more evidence against the null hypothesis Hypothesis Testing: Likelihood Ratio Chi-Square I likelihood ratio chi-square statistic R C 2 X X G = 2 nij log(nij =µîj ) i=1 j=1 I degrees of freedom df = (R − 1)(C − 1) 2 I larger values of G imply more evidence against the null hypothesis 2 I alternative to Pearson's X Measures of Association: Phi Coefficient 2 I based on X statistic and overall total N I take the square root of X 2 φ2 = N I small values mean weak association I large values mean strong association I for a 2 × 2 contingency table: I phi coefficient has same interpretation as Pearson correlation n11n22 − n12n21 φ = p n1+n2+n+1n+2 I values range between [−1; +1] I values near −1 mean strong negative association I values near +1 mean strong positive association I values near 0 mean weak or no association Measures of Association: Contingency Coefficient 2 I based on X statistic and overall total N r X 2 cc = N + X 2 I values always less than 1 I small values mean weak association I large values mean strong association I Useful for a square contingency table larger than 2 × 2 Measures of Association: Cramer's V Coefficient I based on phi coefficient s φ2 V = min(R − 1; C − 1) I R is number total number of rows I C is total number of columns I values range between [0; 1] I small values mean weak association I large values mean strong association I for a 2 × 2 contingency table: I Cramer's V is the same as phi coefficient Using proc freq I proc freq can produce the aforementioned tests for association and measures of association I tables statement sets up the frequency table containing counts, overall percentages, row percentages, and column percentages by default I chisq - runs chi-square and other tests of association & shows measures of association I expected - shows the expected counts for each cell I deviation - shows the residual (observed-expected) for each cell I cellchi2 - shows the chi-square contribution for each cell I norow - hides the row percentages I nocol - hides the column percentages I nopercent - hides the overall percentages I noprint - does not print the table Using proc freq (cont.) When the data exists as a contingency table, the weight statement is used to notify SAS that the values represent observed counts proc freq data=dataName; tables rowVarName*columnVarName / chisq expected; *weight n; run; Oral Contraceptives Data Women patients in several hospitals were asked whether they use oral contraceptives (such as birth control pills). Each case was a married woman who suffered from blood clots (idiopathic thromboembolism) over a 3-year period. These cases were matched to controls - women without blood clots who were discharged alive from the same hospital in the same time interval. What we care about is if there's an association between suffering from blood clots and oral contraceptive usage. See Sartwell et al. (1969) \Thromboembolism and oral contraceptives: An epidemiological case-control study" for more details. Oral Contraceptive Usage: Controls Oral Contraceptive Usage: Cases Used Not Used Used 10 57 Not Used 13 95 Oral Contraceptives Data (cont.) Manually reading it into SAS! data pill; input caseuse $ controluse $ n; cards; Y Y 10 Y N 57 N Y 13 N N 95 ; Heart Attack Data A 5-year randomized study was done on male physicians to study aspirin's affect on cardiovascular disease. The physicians either took one aspirin or one placebo, and they did not know they type of pill they took. What we care about is if the placebo group and the aspirin group experience heart attacks (myocardial infarctions) similarly. See Preliminary Report: Findings from the Aspirin Component of the Ongoing Physicians' Health Study (1988) for more information. Myocardial Infarction Drug Group Yes No Placebo 189 10845 Aspirin 104 10933 Heart Attack Data (cont.) Manually reading it into SAS! data heart; input group $ attack $ n ; datalines; placebo yes 189 placebo no 10845 aspirin yes 104 aspirin no 10933 ; Car Accidents Data The data set is a subset of the National Automotive Sampling System's (NASS) Crashworthiness Data System (CDS) which contains a stratified random sampling of nationwide police-reported crashes between years 1997-2002. CDS data focus on passenger vehicle crashes, and are used to investigate injury mechanisms. Something interesting to investigate is the debunking of the horrible stereotype that women are inadequate drivers. The data contains 26218 observations and will retain 6 of the original 15 variables. filename adata url "https://tinyurl.com/y8hujqvk"; data accidentDB; infile adata dsd dlm='09'x truncover firstobs=2; input weight dead $ airbag $ seatbelt $ frontal sex $ ageOFocc yearacc yearVeh abcat $ occRole $ deploy injSeverity; keep dead weight sex occrole yearacc ageofocc; run; proc print data=accidentDB (obs=50); run; Steam Video Game Data This is a subset of the steam-200k data set on Kaggle with an added ESRB rating column. Steam is a very popular online gaming hub with hundreds of thousands of users playing millions of hours of video games. The video games featured on Steam are of various genre and can be found across multiple consoles such as Playstation, XBox, Wii, and many others. The data is a random selection of 100 users and the game they spent the most hours playing. The variables in the data include user ID, game title, hours played, ESRB rating, genre, and a binary variable indicating whether (1) or not (0) they played more than 40 hours of that game. The ESRB rating does not apply to the online gaming experience, but I use it anyways. A question of interest is if there's an association between the ESRB rating and the number of hours played. Steam Video Game Data (cont.) Reading it into SAS! filename vgdata url "https://tinyurl.com/ya2cvtt6"; data game; infile vgdata dsd dlm='09'x truncover firstobs=2; input userID title $ hoursplayed Rating $ Genre $ over40hours ; run; proc print data=game; run; Weight Perception Data This data comes from a national youth survey, consisting of a nationally representative sample of young people ages 14 to 20 years old as of December 31, 1999 who self report about their weight perception with the prompt \How would you describe your weight?" Variables include age, height (in inches), weight (in pounds), sex, and categorical response about weight perception. Do upperclassmen feel better about their weight than underclassmen? Do young women tend to feel more strongly about their weight than young men? filename wdata url "https://tinyurl.com/yc7dnv8p"; data teenweightDB; infile wdata dsd dlm='09'x truncover firstobs=2; input gender $ age height weight weightperception $16.; run; proc print data=teenweightDB; run; Exercise: Car Accidents Data 1. Create bar graphs of the variables: sex and occupant's role. Make sure the title of the graph says: \Bar Graph of Occupant's Role". 2. Report the frequencies and expected values for the two variables. 3. Run a Pearson X 2 test of association. What conclusions do you draw from the results? 4. Use the measures of association to describe the strength of association between the sex and occupant's role. Hypothesis Testing: Risk Difference For 2 × 2 tables, I risks are binomial proportions I row percentage from proc freq by default I comparing proportions of row 1 to row 2 in the table I We can test whether the difference in risks is significant I H0: risk1 − risk2 = 0 I HA: risk1 − risk2 6= 0 I We

Load more