1 Analysis of Categorical Data

1 Analysis of Categorical Data The techniques presented in previous sections cover the analysis of numerical data and variables. Up to now it is left unanswered how to analyze categorical variables and data. 1.1 Two-way tables How to present data for categorical variables? Remember when doing a data description for a categorical variable we choose to do this with a relative frequency table. But when we have data on two categorical variables and we want to illustrate how the two variables depend on each other we use two-way tables. Two Way Tables: Data resulting from observations made on two categorical variables can be easily summarized in a two way table. Example: Suppose we are interested in the rate of sprouted seeds in two di®erent kinds of water (rainwater, muddy water). The two categorical variables are Sprouted (yes or no) and water type (rainwater and muddy water). Suppose rainwater, muddy water, and tap water were used to water 100 seeds each. Then they were checked and noted how many of those seeds sprouted. The result can then be easily presented in the following table: sprouted yes no total rainwater 64 36 100 muddy water 74 26 100 tap water 60 40 100 Total 138 62 200 A natural question to ask at this point would be if the choice of water has an e®ect on the probability for a seed to sprout. Multiple Comparison The question in the example could be answered by comparing each type of water with each of the remaining, i.e. one would have to make the following comparisons muddy { rain muddy { tap rain { tap. But it would be better to ¯rst ¯nd if there is a di®erence at all between the probabilities. To ¯nd out we will introduce the Â2-test for homogeneity. 1.2 Â2 Test of Homogeneity and Independence in a Two-Way-Table In this section two di®erent situations and questions concerning two categorical variables will be covered. Both will lead to the same test. So, let us ¯rst introduce the two di®erent types of questions and then introduce the test routine to answer these questions. 1 1. Comparing the distribution of a categorical variable in two or more populations. We will be looking at two or more samples from di®erent populations and test if the distributions in all populations are the same. The null hypothesis to be tested in this kind of problem is that the distributions are homoge- neous, they are equal for all populations. H0 : The probabilities for sprouting and nonsprouting is the same for both methods Ha : The probabilities for sprouting and nonsprouting depend on the treatment 2. Testing two categorical variables for independence When you study data that involves two variables, one important consideration is the relationship between the two variables. Does the proportion of measurements in the various categories for factor 1 depend on which category of factor 2 is being observed? Example: A survey was conducted to evaluate the e®ectiveness of a new flu vaccine that had been administered in a small community. It consists of a two{shot sequence in two weeks. A survey of 1000 residents the following spring provided the following information: No vaccine One Shot Two Shots Total Flu 24 9 13 46 No Flu 289 100 565 954 Total 313 109 578 1000 The question to be answered is, if the flu shot had an impact on the incidence of the flu. Or we could ask if the incidence of the flu is independent from the vaccine. In order to answer this question, a Â2 test can be conducted, testing the following hypotheses: H0 : No relationship between treatment and incidence of flu Ha : Incidence of flu depends on amount of flu treatment The two questions lead to the exact same Â2 test. The Â2 distribution The Â2{ distribution is neither a normal nor a t-distribution. Table D gives upper tail areas for di®erent degrees of freedom. 2 The Â2 Test for Homogeneity and Independence Given are a categorical variable with R categories and one categorical variable with C categories. Hypotheses: H0: The two categorical variables are independent Ha : H0 is not true. Assumption: The sample size is large. The sample size is considered large enough as long as every count is at least 1, and not more than 20% of counts are less than 5. Test statistic: Compute for every cell of the two{way table the expected frequency (row total)(column total) E = expectedfrequency = Sample size and then X 2 2 (O ¡ E) Â0 = all cells E where O is the observed count. 2 2 2 P-value: P (Â > Â0) the upper tail area for a Â distribution with (R ¡ 1)(C ¡ 1) df, found in Appendix Table D. Decision: As usual. Context: As usual. Continue Flu Example: 3 1. Hypotheses H0: The flu is independent of the vaccine status versus Ha : H0 is not true. Let's perform this test at a signi¯cance level of ® = 0:05. 2. Assumptions are easily met the sample sizes are all greater than 5. 3. Test Statistic The two-way table gives the observed frequencies O. Next calculate the expected cell counts E for each cell: Flu/no vaccine: (46¢313)/1000=14.40 Flu/one shot: (46¢109)/1000=5.01 Flu/two shots: (46¢578)/1000=26.59 No Flu/no vaccine: (954¢313)/1000=298.60 No Flu/one shot: (954¢109)/1000=103.99 No Flu/two shots: (954¢578)/1000=551.41 Put the expected cell counts into the table: No vaccine One Shot Two Shots Total Flu 24 9 13 46 14.40 5.01 26.59 No Flu 289 100 565 954 298.60 103.99 551.41 Total 313 109 578 1000 Now ¯nd for each cell (O ¡ E)2=E and add these fractions: Â2 = 6:404 + 3:169 + 6:944 + 0:309 + 0:153 + 0:335 = 17:313 and df=(3-1)(2-1)=2 4. P-value Table D provides us for df = 2 with P-value<0.0005. 5. Decision Since the P-value is less than ® reject the null hypothesis. Context: At signi¯cance level of 5% the data do provide su±cient evidence that the probability of getting the flu is not the same for all three vaccination groups. The nature of the relationship has still to be explored: For example estimate the following probabilities P (flu=yesjvaccine=0) estimate 24/313=0.0767 P (flu=yesjvaccine=1) estimate 9/109=0.0826 P (flu=yesjvaccine=2) estimate 13/578=0.0225 From this we ¯nd that the estimated probability to get the flu given a person had 2 shots is much less than the estimated probabilities to get the flu given that a person had no or only one shot. 4 Example: Some time ago there was a report in the news that an AIDS vaccine tested in Thailand didn't show any e®ect. The data quoted in the news is presented in the two{way table below (including the expected cell counts): Placebo Vaccine Total HIV+ 105 106 211 105.5 105.5 HIV- 1168 1167 2335 1167.5 1167.5 Total 1273 1273 2546 Â2 = 0:00237 + 0:00237 + 0:000214 + 0:000214 = 0:005168 and df=(2-1)(2-1)=1 Conduct a test of homogeneity : 1. Hypotheses: H0 : The probability to get AIDS, is the same for the vaccinated and the placebo group versus Ha: H0 is not true. ® = 0:05 2. The sample sizes are all large enough. 3. Test Statistic: See calculations above: Â2 = 0:0052, with 1 df, 4. P-value From Table D we ¯nd P-value>0.25 5. Decision: Do not reject H0 since the p-value is greater than ®. 6. Context: At signi¯cance level ® = 0:05 the data do not provide su±cient evidence that the HIV infection{rate was impacted by the vaccine. 1.3 Â2 Goodness of ¯t Test In this section the Â2 test for comparing the relative frequency distribution from a sample with a given probability distribution is introduced. Example: A company ¯lling grass seed bags wants to evaluate their ¯lling machine. The following distribution is advertised on their bags, where K1-K5 are di®erent kinds of grass seeds: kind of seeds proportion K1 0.5 K2 0.25 K3 0.15 K4 0.05 K5 0.05 5 The company wants to check if the seed distribution in the bags ¯ts the advertised distribution. They take a sample of size 1000 and ¯nd the following summarized data: kind of seeds count K1 480 K2 233 K3 160 K4 63 K5 64 In order to check if the label is a truthful description of the contents of the seed bags, we want to compare the claimed distribution with the sample data. Notation: for a given categorical random variable k = number of categories p1 = true probability to fall in category 1 p2 = true probability to fall in category 2 . pk = true probability to fall in category k In order to compare the observed frequencies with the hypothesized distribution we study the Â2 goodness{of{¯t statistic. O1 = observed cell count for category 1 O2 = observed cell count for category 2 . Ok = observed cell count for category k These observed counts shall be compared with the expected count, if the claim is true (the label is true). If we examine an event that occurs with probability p, then in a sample of size n we would expect to see this event about n ¢ p times.

Load more