1 Analysis of Categorical

The techniques presented in previous sections cover the analysis of numerical data and vari- ables. Up to now it is left unanswered how to analyze categorical variables and data.

1.1 Two-way tables How to present data for categorical variables? Remember when doing a data description for a we choose to do this with a relative table. But when we have data on two categorical variables and we want to illustrate how the two variables depend on each other we use two-way tables.

Two Way Tables: Data resulting from observations made on two categorical variables can be easily summarized in a two way table.

Example: Suppose we are interested in the rate of sprouted seeds in two different kinds of water (rain- water, muddy water). The two categorical variables are Sprouted (yes or no) and water type (rainwater and muddy water). Suppose rainwater, muddy water, and tap water were used to water 100 seeds each. Then they were checked and noted how many of those seeds sprouted. The result can then be easily presented in the following table:

sprouted yes no total rainwater 64 36 100 muddy water 74 26 100 tap water 60 40 100 Total 138 62 200

A natural question to ask at this point would be if the choice of water has an effect on the probability for a seed to sprout.

Multiple Comparison The question in the example could be answered by comparing each type of water with each of the remaining, i.e. one would have to make the following comparisons muddy – rain muddy – tap rain – tap. But it would be better to first find if there is a difference at all between the probabilities. To find out we will introduce the χ2-test for homogeneity.

1.2 χ2 Test of Homogeneity and Independence in a Two-Way-Table In this section two different situations and questions concerning two categorical variables will be covered. Both will lead to the same test. So, let us first introduce the two different types of questions and then introduce the test routine to answer these questions.

1 1. Comparing the distribution of a categorical variable in two or more populations.

We will be looking at two or more samples from different populations and test if the distribu- tions in all populations are the same.

The null hypothesis to be tested in this kind of problem is that the distributions are homoge- neous, they are equal for all populations.

H0 : The probabilities for sprouting and nonsprouting is the same for both methods Ha : The probabilities for sprouting and nonsprouting depend on the treatment

2. Testing two categorical variables for independence

When you study data that involves two variables, one important consideration is the relation- ship between the two variables. Does the proportion of measurements in the various categories for factor 1 depend on which category of factor 2 is being observed?

Example: A was conducted to evaluate the effectiveness of a new flu vaccine that had been administered in a small community. It consists of a two–shot sequence in two weeks. A survey of 1000 residents the following spring provided the following information:

No vaccine One Shot Two Shots Total Flu 24 9 13 46 No Flu 289 100 565 954 Total 313 109 578 1000

The question to be answered is, if the flu shot had an impact on the incidence of the flu. Or we could ask if the incidence of the flu is independent from the vaccine. In order to answer this question, a χ2 test can be conducted, testing the following hypotheses:

H0 : No relationship between treatment and incidence of flu Ha : Incidence of flu depends on amount of flu treatment The two questions lead to the exact same χ2 test.

The χ2 distribution The χ2– distribution is neither a normal nor a t-distribution. Table D gives upper tail areas for different degrees of freedom.

2 The χ2 Test for Homogeneity and Independence Given are a categorical variable with R categories and one categorical variable with C cate- gories.

Hypotheses: H0: The two categorical variables are independent

Ha : H0 is not true. Assumption: The size is large. The sample size is considered large enough as long as every count is at least 1, and not more than 20% of counts are less than 5.

Test : Compute for every cell of the two–way table the expected frequency

(row total)(column total) E = expectedfrequency = Sample size

and then X 2 2 (O − E) χ0 = all cells E where O is the observed count.

2 2 2 P-value: P (χ > χ0) the upper tail area for a χ distribution with (R − 1)(C − 1) df, found in Appendix Table D.

Decision: As usual.

Context: As usual.

Continue Flu Example:

3 1. Hypotheses H0: The flu is independent of the vaccine status versus Ha : H0 is not true. Let’s perform this test at a significance level of α = 0.05. 2. Assumptions are easily met the sample sizes are all greater than 5. 3. The two-way table gives the observed frequencies O.

Next calculate the expected cell counts E for each cell: Flu/no vaccine: (46·313)/1000=14.40 Flu/one shot: (46·109)/1000=5.01 Flu/two shots: (46·578)/1000=26.59 No Flu/no vaccine: (954·313)/1000=298.60 No Flu/one shot: (954·109)/1000=103.99 No Flu/two shots: (954·578)/1000=551.41

Put the expected cell counts into the table:

No vaccine One Shot Two Shots Total Flu 24 9 13 46 14.40 5.01 26.59 No Flu 289 100 565 954 298.60 103.99 551.41 Total 313 109 578 1000

Now find for each cell (O − E)2/E and add these fractions:

χ2 = 6.404 + 3.169 + 6.944 + 0.309 + 0.153 + 0.335 = 17.313 and df=(3-1)(2-1)=2 4. P-value Table D provides us for df = 2 with P-value<0.0005. 5. Decision Since the P-value is less than α reject the null hypothesis. Context: At significance level of 5% the data do provide sufficient evidence that the probability of getting the flu is not the same for all three vaccination groups.

The nature of the relationship has still to be explored: For example estimate the following probabilities P (flu=yes|vaccine=0) estimate 24/313=0.0767 P (flu=yes|vaccine=1) estimate 9/109=0.0826 P (flu=yes|vaccine=2) estimate 13/578=0.0225 From this we find that the estimated probability to get the flu given a person had 2 shots is much less than the estimated probabilities to get the flu given that a person had no or only one shot.

4 Example: Some time ago there was a report in the news that an AIDS vaccine tested in Thailand didn’t show any effect. The data quoted in the news is presented in the two–way table below (including the expected cell counts):

Placebo Vaccine Total HIV+ 105 106 211 105.5 105.5 HIV- 1168 1167 2335 1167.5 1167.5 Total 1273 1273 2546

χ2 = 0.00237 + 0.00237 + 0.000214 + 0.000214 = 0.005168 and df=(2-1)(2-1)=1 Conduct a test of homogeneity :

1. Hypotheses: H0 : The probability to get AIDS, is the same for the vaccinated and the placebo group versus Ha: H0 is not true. α = 0.05

2. The sample sizes are all large enough.

3. Test Statistic: See calculations above: χ2 = 0.0052, with 1 df,

4. P-value From Table D we find P-value>0.25

5. Decision: Do not reject H0 since the p-value is greater than α. 6. Context: At significance level α = 0.05 the data do not provide sufficient evidence that the HIV infection–rate was impacted by the vaccine.

1.3 χ2 Goodness of fit Test In this section the χ2 test for comparing the relative from a sample with a given is introduced.

Example: A company filling grass seed bags wants to evaluate their filling machine. The following distribution is advertised on their bags, where K1-K5 are different kinds of grass seeds:

kind of seeds proportion K1 0.5 K2 0.25 K3 0.15 K4 0.05 K5 0.05

5 The company wants to check if the seed distribution in the bags fits the advertised distribution. They take a sample of size 1000 and find the following summarized data:

kind of seeds count K1 480 K2 233 K3 160 K4 63 K5 64

In order to check if the label is a truthful description of the contents of the seed bags, we want to compare the claimed distribution with the sample data.

Notation: for a given categorical

k = number of categories p1 = true probability to fall in category 1 p2 = true probability to fall in category 2 . . pk = true probability to fall in category k

In order to compare the observed frequencies with the hypothesized distribution we study the χ2 goodness–of–fit statistic.

O1 = observed cell count for category 1 O2 = observed cell count for category 2 . . Ok = observed cell count for category k

These observed counts shall be compared with the expected count, if the claim is true (the label is true). If we examine an event that occurs with probability p, then in a sample of size n we would expect to see this event about n · p times. The χ2 statistic is based on this fact.

For a given hypothesized population distribution p01, . . . , p0k (in the example, this is the label) define: E1 = expected cell count for category 1 = n · p01 E2 = expected cell count for category 2 = n · p02 . . Ek = expected cell count for category k = n · p0k The χ2 statistic is then:

Xk (O − E )2 χ2 = i i i=1 Ei 2 2 If p01, . . . , p0k is the true distribution, then the χ statistic is χ – distributed with df=k − 1 degrees of freedom,

6 Continue Example: With p01 = 0.5, p02 = 0.25, p03 = 0.15, p04 = 0.05, p05 = 0.05 we get:

2 kind of seeds p0i Oi Ei = np0i (0i − Ei) /Ei K1 0.5 480 500 (480-500)2/500=0.8 K2 0.25 233 250 (233-250)2/250=1.156 K3 0.15 160 150 (160-150)2/150=0.666 K4 0.05 63 50 (63-50)2/50=3.38 K5 0.05 64 50 (64-50)2/50=3.92 n = 1000 2 And χ0 = 0.8 + 1.156 + 0.666 + 3.38 + 3.92 = 9.9226. This number now has to be evaluated as coming from a χ2 distribution with 5-1=4 df. The χ2 Goodness–of–fit Test Given a distribution p01, . . . p0k

Hypotheses: H0 : p1 = p01, . . . , pk = p0k versus Ha : H0 is not true. Assumption: The sample size is large. The sample size is large enough for the χ2 test to be appropriate as long as every observed count is at least 5. Test statistic: X 2 2 (Oi − Ei) χ0 = all categories Ei with df = k − 1

2 2 2 P-value: P (χ > χ0) the uppertail area for a χ distribution with k − 1 df, found in Appendix Table D. Decision: Is the P-value ≤ α , then reject H0 Is the P-value > α , then do not reject H0

Continue Example:

1. Hypotheses: H0 : p1 = 0.5, p2 = 0.25, p3 = 0.15, p4 = 0.05, p5 = 0.05 versus Ha : H0 is not true. 2. Assumptions: The observed counts are all large enough. 3. Test Statistic: 2 The test statistic is χ0 = 9.9226 with df=4 (see calculations above). 4. P-value: From Table D for df=4 find that 9.9922 falls between 9.49 and 11.14, with upper tail probabilities 0.050 and 0.025. So we conclude 0.025

5. Decision: Since the p-value≤ 0.05 = α, we reject H0 at significance level of α = 0.05. 6. Interpretation: At significance level of 5% we find sufficient evidence in the sample that the label is not giving a true description of the contents of the seed bag.

7