Estimation and Testing for Association with Multiple-Response Categorical Variables from Complex Surveys

Section on Survey Research Methods Estimation and Testing for Association with Multiple-Response Categorical Variables from Complex Surveys Christopher R. Bilder1, Thomas M. Loughin2 Department of Statistics, University of Nebraska-Lincoln, [email protected], http://www.chrisbilder.com1 Department of Statistics and Actuarial Science, Simon Fraser University Surrey, Surrey, BC, V3T0A3, [email protected] KEY WORDS: Correlated binary data; Loglinear model; NHANES; Pearson statistic; Pick any/c; Rao-Scott adjustments Scherer, 1998). Only recently has work been done to 1. Introduction address these problems. For tests of independence Many surveys include questions that invite between one SRCV and one MRCV, Agresti and Liu respondents to “choose all that apply” or “pick any” from (1999), Bilder et al. (2000), Decady and Thomas (2000), a series of items. A recent Yahoo! internet search on the and Bilder and Loughin (2001) describe various phrases “survey” and “choose all that apply” yielded over adjustments to the Pearson statistic and methods to 2000 hits, many from online surveys. Issues of statistical approximate the resulting sampling distributions. Agresti validity of voluntary-response surveys notwithstanding, and Liu (1999) point out that MRCVs can be expressed this points to the fact that these questions are ubiquitous as binary vectors wherein each element of the vector in modern survey methodology. Furthermore, the United indicates whether the corresponding item is chosen as one States Office of Management and Budget has mandated of the responses. Decady and Thomas (2000) cleverly that federal surveys ask questions about race and note the parallel between an application of an adjusted ethnicity in a “choose all that apply” format (Federal Pearson statistic to MRCVs and the use of the Pearson Register, 1997, p. 58781), allowing members of the statistic in non-multinomial sampling structures as increasingly multiracial population to acknowledge their studied by Rao and Scott (1981), although the form of the complex ethnicity. Given the frequency with which these Pearson statistic used by Decady and Thomas is not questions occur, it is vital to have good methods for invariant to the 0/1 coding of the binary vectors (a statistical analysis of the data they provide. different value of the statistic results if the elements Variables that summarize survey data arising from indicate non-selection of that item). Thomas and Decady “choose all that apply” questions are referred to as (2004) and Bilder and Loughin (2004) discuss tests of multiple-response categorical variables (MRCVs). They independence between two MRCVs. present a challenge because they cannot be handled in the Beyond testing for association, there have been a few same manner as the usual single-response categorical efforts to model associations involving MRCVs. Agresti variables (SRCVs), although this has only recently been and Liu (1999, 2001) propose marginal logit models to recognized in the literature (Umesh, 1995; Loughin and describe the association between a single MRCV and a Scherer, 1998). When two or more categorical variables single SRCV. They suggest, but do not explore, are measured, questions naturally arise regarding the extensions to multiple MRCVs. Bilder and Loughin associations among them. The difficulty with the analysis (2003, 2007) examine these suggestions and propose their of associations involving MRCVs comes from the fact own modeling procedure. They conclude that a that individual subjects may respond positively to more generalized loglinear model fit using a marginal than one item from a list, and these responses are likely to estimation procedure is the preferred model due to be correlated. The result is that tests for independence computational ease, flexibility, and overall performance. between categorical variables involving MRCVs cannot Inference using this model makes use of work by Rao and be performed in the “usual” ways. For example, the Scott (1984) and Haber (1985). In particular, model Pearson test statistic for independence is not invariant to fitting uses a “pseudo” maximum likelihood approach the arbitrary choice of whether the positive or the based upon an incorrect assumption of a multinomial negative responses are tabulated (Agresti and Liu, 1999; distribution for the observed marginal counts, and then Bilder, Loughin, and Nettleton, 2000; Bilder and adjustments are applied to the sampling distributions of Loughin, 2001). Versions of the Pearson statistic that are the various estimators and test statistics. modified to be invariant to this choice of coding have All previous work on MRCVs has been conducted distributions that are not, in general, chi-square (Agresti under the assumption of simple random sampling with and Liu, 1999; Bilder et al. 2000, Bilder and Loughin, replacement, so that measurements on sampled subjects 2004). can be viewed as a set of independent, identically- Ignoring these problems and simply using the usual distributed random variables from some infinite Pearson statistic and its associated chi-square distribution population. No methods are currently available for the provides a test with very poor properties (Loughin and common situation of testing and modeling association 2838 Section on Survey Research Methods among MRCVs based on data arising from complex 2. Background survey sampling involving, for example, probability proportional to size, stratification, and/or clustering. The 2.1 General notation present research combines the results of Rao and Scott Let U denote a population of N units, and let s be a (1984) and Bilder and Loughin (2007) to extend existing sample of n units selected from U according to some modeling and analysis techniques in order to provide probability sampling plan with known first-order valid analysis of data from complex survey designs. inclusion probabilities. Let wu, u = 1, …, N, represent Beyond the measured responses, the only additional survey weights. These may be simply the inverse of the information that the proposed methods require is a set of first-order inclusion probabilities, or they may be more survey weights that permit unbiased estimation of complicated to account for nonresponse, post- population totals, and a method of calculating the stratification, and so forth. Assume that these weights are variance of these estimates. General tests for association, constructed to lead to unbiased estimates of population like those developed by Loughin and Scherer (1998), totals. Bilder et al. (2000), Thomas and Decady (2004), and To simplify the exposition, consider the case of two Bilder and Loughin (2004), are obtained as goodness-of- MRCVs, Y = (Y1, …, YI) and Z = (Z1, …, ZJ), where Yi is fit tests from the proposed models. the binary response to item i of Y and Zj is the binary The National Health and Nutrition Examination response to item j of Z. Let the corresponding observed Survey (NHANES) provides an excellent opportunity to values in the population be yu = (yu1, …, yuI) and zu = (zu1, apply the proposed methods. This survey is a large, …, zuJ ) for u = 1, …, N. Extensions to more than two nationwide study of the health and diet of people in the MRCVs are discussed in Section 5. Consider United States and is conducted periodically using a subpopulations of U corresponding to different (y, z) multistage, complex survey sampling design. The 1999- combinations with U(y, z) = {u: (yu, zu) = (y, z)}. The 2000 survey contains numerous questions that can be population total count for combination (y, z) is N(y, z) = treated as “choose all that apply” questions. The focus ∑u∈U δ (u ∈ U()y,z ), where δ(⋅) is an indicator function. here will be on questions asking the respondent about The sample-weighted estimate of the population total lifetime tobacco use (>100 cigarettes, >20 pipes, >20 count for combination (y, z) is N (,)y z = cigars, >20 snuff, and >20 chewing tobacco) and types of ∑us∈ wuuδ ( ∈ U()y,z ). These counts can be represented respiratory problems experienced (cough on most days, as 2I+J×1 vectors, N and N . Without loss of generality, bring up phlegm on most days, experienced wheezing in assume that the elements of each vector are arranged in chest, dry cough at night). The exact survey questions and lexicographic order according to the binary numerical data are available on the NHANES website at value of (y, z). Thus, element k corresponds to the www.cdc.gov/nchs/about/major/nhanes/NHANES99_00. combination (y, z) that is the binary equivalent of the htm. Table 1 displays survey-design adjusted proportions decimal value k – 1. for the number of individuals who responded positively Next, consider the population marginal count for (yi = to each item. Note that individual survey respondents a, zj = b), where a, b ∈ {0, 1}, Mab(ij) = may be represented in more than one table cell because ∑u∈U δ (,yazbui= uj = ), which is estimated by they could use more than one type of tobacco and/or have M ab() ij ==∑us∈ wyuδ (, ui azb uj =). The corresponding more than one type of respiratory problem so that population and estimated proportions are Pab(ij) = Mab(ij)/N common Pearson chi-square tests for independence or and PMab() ij= ab () ij /,N respectively, where Nw= ∑us∈ u . loglinear models accounting for the survey design can not Table 2 shows the estimated proportions for the be used. The purpose of this paper is to show how one respiratory symptoms and tobacco use data from can simultaneously model and estimate the association NHANES. Note that Mab(ij) = bNab′ () ij and M ab() ij = structure between MRCVs in this setting. I+J bNab′ () ij for a suitably-chosen 1×2 row vector bab′ () ij . This paper is organized as follows. Notation and The elements of bab() ij are δ(yi = a, zj = b) = 0 or 1 with preliminary details are given in Section 2, followed by a order corresponding to the 2I+J ordered (y, z) values. detailed description of the model and associated inference Thus, we have the representations M = BN and procedures.

Load more