Analysis of Categorical Data

05-Elliott-4987.qxd 7/18/2006 5:26 PM Page 113 5 Analysis of Categorical Data eople like to clump things into categories. Virtually every research P project categorizes some of its observations into neat, little distinct bins: male or female; marital status; broken or not broken; small, medium, or large; race of patient; with or without a tonsillectomy; and so on. When we collect data by categories, we record counts—how many observations fall into a particular bin. Categorical variables are usually classified as being of two basic types: nominal and ordinal. Nominal variables involve categories that have no particular order such as hair color, race, or clinic site, while the categories associated with an ordinal variable have some inherent ordering (categories of socioeconomic status, etc.). Unless otherwise stated, the procedures discussed here can be used on any type of categorical data. There are some specific procedures for ordinal data, and they will be briefly discussed later in the chapter. Statisticians have devised a number of ways to analyze and explain categorical data. This chapter presents explanations of each of the following methods: • A contingency table analysis is used to examine the relationship between two categorical variables. • McNemar’s test is designed for the analysis of paired dichotomous, categorical variables to detect disagreement or change. • The Mantel-Haenszel test is used to determine whether there is a relationship between two dichotomous variables controlling for or within levels of a third variable. 113 05-Elliott-4987.qxd 7/18/2006 5:26 PM Page 114 114—— Statistical Analysis Quick Reference Guidebook • Interrater reliability (kappa) tests whether two raters looking at the same occurrence (or condition) give consistent ratings. • A goodness-of-fit test measures whether an observed group of counts matches a theoretical pattern. • A number of other categorical data measures are also briefly discussed. To get the most out of this chapter, you should first verify that your variables are categorical and then try to match the hypotheses you are testing with the ones described in this chapter. If it is not clear that the hypotheses you are testing match any of these given here, we recommend that you consult a statistician. Contingency Table Analysis (r × c) Contingency table analysis is a common method of analyzing the association between two categorical variables. Consider a categorical variable that has r possible response categories and another categorical variable with c possible categories. In this case, there are r × c possible combinations of responses for these two variables. The r × c crosstabulation or contingency table has r rows and c columns consisting of r × c cells containing the observed counts (frequencies) for each of the r × c combinations. This type of analysis is called a contingency table analysis and is usually accomplished using a chi-square statistic that compares the observed counts with those that would be expected if there were no association between the two variables. Appropriate Applications of Contingency Table Analysis The following are examples of situations in which a chi-square contingency table analysis would be appropriate. • A study compares types of crime and whether the criminal is a drinker or abstainer. • An analysis is undertaken to determine whether there is a gender preference between candidates running for state governor. • Reviewers want to know whether worker dropout rates are different for par- ticipants in two different job-training programs. • A marketing research company wants to know whether there is a difference in response rates among small, medium, and large companies that were sent a questionnaire. 05-Elliott-4987.qxd 7/18/2006 5:26 PM Page 115 Analysis of Categorical Data—— 115 Design Considerations for a Contingency Table Analysis Two Sampling Strategies Two separate sampling strategies lead to the chi-square contingency table analysis discussed here. 1. Test of Independence. A single random sample of observations is selected from the population of interest, and the data are categorized on the basis of the two variables of interest. For example, in the marketing research example above, this sampling strategy would indicate that a single random sample of companies is selected, and each selected company is categorized by size (small, medium, or large) and whether that company returned the survey. 2. Test for Homogeneity. Separate random samples are taken from each of two or more populations to determine whether the responses related to a single categorical variable are consistent across populations. In the marketing research example above, this sampling strategy would consider there to be three populations of companies (based on size), and you would select a sample from each of these populations. You then test to determine whether the response rates differ among the three company types. The two-way table is set up the same way regardless of the sampling strategy, and the chi-square test is conducted in exactly the same way. The only real difference in the analysis is in the statement of the hypotheses and conclusions. Expected Cell Size Considerations The validity of the chi-square test depends on both the sample size and the number of cells. Several rules of thumb have been suggested to indicate whether the chi-square approximation is satisfactory. One such rule suggested by Cochran (1954) says that the approximation is adequate if no expected cell frequencies are less than one and no more than 20% are less than five. Combining Categories Because of the expected cell frequency criterion in the second sampling strategy, it may be necessary to combine similar categories to lessen the number of categories in your table or to examine the data by subcategories. See the section that follows later in this chapter on Mantel-Haenszel compar- isons for information on one way to examine information within categories. 05-Elliott-4987.qxd 7/18/2006 5:26 PM Page 116 116—— Statistical Analysis Quick Reference Guidebook Hypotheses for a Contingency Table Analysis The statement of the hypotheses depends on whether you used a test of independence or a test for homogeneity. Test of Independence In this case, you have two variables and are interested in testing whether there is an association between the two variables. Specifically, the hypotheses to be tested are the following: H0: There is no association between the two variables. Ha: The two variables are associated. Test for Homogeneity In this setting, you have a categorical variable collected separately from two or more populations. The hypotheses are as follows: H0: The distribution of the categorical variable is the same across the populations. Ha: The distribution of the categorical variable differs across the populations. Tips and Caveats for a Contingency Table Analysis Use Counts—Do Not Use Percentages It may be tempting to use percentages in the table and calculate the chi- square test from these percentages instead of the raw observed frequencies. This is incorrect—don’t do it! No One-Sided Tests Notice that the alternative hypotheses above do not assume any “direc- tion.” Thus, there are no one- and two-sided versions of these tests. Chi- square tests are inherently nondirectional (“sort of two-sided”) in the sense that the chi-square test is simply testing whether the observed frequencies and expected frequencies agree without regard to whether particular observed frequencies are above or below the corresponding expected frequencies. Each Subject Is Counted Only Once If you have n total observations (i.e., the total of the counts is n), then these n observations should be independent. For example, suppose you have a categorical variable Travel in which subjects are asked by what means they 05-Elliott-4987.qxd 7/18/2006 5:26 PM Page 117 Analysis of Categorical Data—— 117 commute to work. It would not be correct to allow a subject to check multiple responses (e.g., car and commuter train) and then include all of these responses for this subject in the table (i.e., count the subject more than once). On such a variable, it is usually better to allow only one response per variable. If you want to allow for multiple responses such as this, then as you are tallying your results, you would need to come up with a new category, “car and commuter train.” This procedure can lead to a large number of cells and small expected cell frequencies. Explain Significant Findings Unlike many other tests, the simple finding of a significant result in a contingency table analysis does not explain why the results are significant. It is important for you to examine the observed and expected frequencies and explain your findings in terms of which of the differences between observed and expected counts are the most striking. Contingency Table Examples The following two examples of contingency table analysis illustrate a variety of the issues involved in this type of analysis. EXAMPLE 5.1: r × c Contingency Table Analysis Describing the Problem In 1909, Karl Pearson conducted a now classic study involving the relationship between criminal behavior and the drinking of alcoholic beverages. He studied 1,426 criminals, and the data in Table 5.1 show the drinking patterns in various crime categories. (The term coining in the table is a term for counterfeiting that is no longer in common usage.) This table is made up of counts in 6 × 2 cells, and, for example, 300 subjects studied were abstain- ers who had been convicted of stealing. The hypotheses of interest are as follows: H0: There is no association between type of crime and drinking status. Ha: There is an association between type of crime and drinking status. In Table 5.2, we show output for these data where we see not only the cell frequencies shown in Table 5.1 but also the expected cell frequencies and the row percentages.

Load more