Journal of Official Statistics, Vol. 26, No. 2, 2010, pp. 301–315
Tests of Multivariate Hypotheses when using Multiple Imputation forMissing Data and DisclosureLimitation
Satkartar K. Kinney1 and Jerome P. Reiter2
Several statistical agencies use, or are considering the use of, multiple imputation to limit the risk of disclosing respondents’ identities or sensitive attributes in public use data files. For example, agencies can release partially synthetic datasets, comprising the units originally surveyed with some values, such as sensitive values at high risk of disclosure or values of key identifiers, replaced with multiple imputations. This can be coupled with multiple imputation for missing data in atwo-stage imputation approach. First the agency fills in the missing data to generate m completed datasets, then replaces sensitive or identifying values in each completed dataset with n imputed values. Methods for obtaining inferences with the mn datasets have been developed for scalar quantities, but not for multivariate quantities. We present methods for testing multivariate null hypotheses with such datasets. We illustrate the tests using public use files for the Survey of Income and Program Participation that were created with the two-stage imputation approach.
Key words: Confidentiality; disclosure; multiple imputation; significance tests; synthetic data.
1. Introduction Statistical agencies and otherorganizations that disseminate data to the public are ethically, practically,and often legally required to protect the confidentiality of respondents’ identities and sensitive attributes. To satisfy theserequirements, Rubin (1993) and Little (1993) proposed that agencies utilize multiple imputation approaches. For example, agencies can releasethe units originally surveyed with somevalues, such as sensitive values at high risk of disclosure or values of key identifiers, replacedwith multipleimputations. These are calledpartially synthetic datasets (Reiter 2003). In recent years, statistical agencies have begun to use partially synthetic approaches to create public use data for major surveys. For example, in 2007 the U.S. Census Bureau released apartially synthetic, public use file for the Survey of Income and Program Participation (SIPP) that includesimputed values of SocialSecurity benefits information anddozensofother highly sensitivevariables (www.sipp.census.gov/sipp/synth data.html). The U.S. Census Bureau alsoplanstoprotect the identities of people in group quarters(e.g., prisons, shelters) in the next releaseofpublic use files from the American Communities Survey by replacing demographic data for peopleathigh
1 National Institute of Statistical Sciences, PO Box 14006, Research Triangle Park, NC 27709, U.S.A. Email: [email protected] 2 Department of Statistical Science, Duke University, Box 90251, Durham, NC 27708, U.S.A. Email: [email protected] Acknowledgments: This research was supported by the U.S. National Science Foundation grant ITR-0427889. q Statistics Sweden 302 JournalofOfficial Statistics disclosure risk with imputations.Partially synthetic, public use datasets are in the development stage for the U.S. Census Bureau’s Longitudinal Business Database, Longitudinal Employer-Household Dynamics survey, and American Communities Survey veteransand full sample data. Statistical agencies in Canada, Germany (Drechsler et al. 2007), and New Zealand (Graham and Penny 2005) alsoare investigating the approach. Other applications of partially synthetic data are described by Kennickell (1997), Abowd and Woodcock (2001, 2004), Abowd and Lane (2004),Little et al. (2004), Reiter (2005b), Mitra and Reiter (2006),Reiter and Mitra (2008), An and Little (2007),and Reiter and Raghunathan (2007). In addition to protectingconfidentiality, statisticalagencies releasing public use data nearly always have to deal with surveynonresponse.Agencies convenientlycan handle the missing data and confidentiality protection simultaneously with atwo-stage multiple imputation approach (Reiter 2004). First, the agencyuses multiple imputation to fill in the missing data, generating m completed datasets. Second, the agencyreplacesthe values at risk of disclosure in each imputed dataset with n multipleimputations, ultimately releasing mn datasets. Thenesting of imputations enables analyststosimply and properly account for the different sources of variabilityarising from the two types of imputations, which is not straightforwardwithout nesting (Reiter 2004). This two-stage approach was used to create synthetic public use files for the Survey of Income and Program Participation and is being considered for other partially synthetic products at the U.S. Census Bureau. Methods for obtaining inferences with such two-stage synthetic datasets have been developed for scalar quantitiesbut not for multivariate quantities. Given the importanceof the SIPP –itisthe largest and most widely used data source on people on public assistance in the U.S. –and the growing interest in using multiple imputation for releasing confidential data,there is aclear need for methodology for obtaining multivariate inferences with such datasets. This article helpsaddress this need by presenting methods for large sample tests of multivariate null hypotheses when multiple imputation is used simultaneously for missing and partially synthetic data. The remainder of the article is organizedasfollows. In Section 2, we review the two- stage procedure of Reiter (2004) and extend its distributional theory to multivariate estimands.InSection 3, we use this theory to derive aWald-like test for multivariate null hypothesesand illustrate the test on the partially synthetic, public use data for the SIPP. In Section 4, we describeanasymptotically equivalent test based on likelihood ratio statistics.Finally, in Section 5, we provide some concluding remarks.
2. Multiple Imputation forMissing Data and Disclosure Limitation For afinite population of size N ,let I ¼ 1ifunit l is includedinthe survey,and I ¼ 0 l P l otherwise, where l ¼ 1 ;:::; N .Let I ¼ðI 1 ;:::;I N Þ ,and let the sample size s ¼ I 1 . Let X be the N £ d matrix of sampling design variables, e.g., stratum or clusterindicators or size measures. We assume that X is knownapproximately for the entire population, for examplefrom census records or the sampling frame(s). Let Y be the N £ p matrix of survey data for the population. Let Y inc ¼ðY obs; Y misÞ be the s £ p submatrix of Y for all units with I l ¼ 1, where Y obs is the portion of Y inc that is observed and Y mis is the portion of Y inc that is missing due to nonresponse. Let R be an N £ p matrix of indicators such that Kinney and Reiter: Tests of Multivariate Hypotheses 303
R lk ¼ 1ifthe response for unit l to item k is recorded, and R lk ¼ 0otherwise. Theobserved data is thus D obs ¼ðX ; Y obs; I ; R Þ . To generate the synthetic data, the agencyfirst fills in values for Y mis with draws from the conditional distribution of ( Y misj D obs), or approximations of that distributionsuch as those of Raghunathan et al. (2001). These drawsare repeatedi ndependently i ¼ 1 ;:::; m ð i Þ ð i Þ times to obtain m completed datasets, D com ¼ D com ¼ D obs; Y mis ; i ¼ 1 ;:::; m . Having dealtwiththe missing data, the agencylimitsdisclosure risks by replacing selected ð i Þ ð i Þ values in each D com with multiple imputations. For each D com,imputations are made independently j ¼ 1 ;:::;n timestoyield n different partially synthetic data sets.
Let Z l ¼ 1ifunit l is selected to have any of its data replacedwith synthetic values, and ð i ; j Þ let Z l ¼ 0for thoseunits with all data left unchanged. Let Z ¼ðZ 1 ;:::; Z s Þ .Let Y rep be ð i Þ all the imputed (replaced) values in the j th synthetic dataset associated with D com,and let Y ð i Þ be all unchanged (not replaced) values of D ð i Þ .The Y ð i ; j Þ are generated from the nrep com rep conditional distribution of Y ð i ; j Þ j D ð i Þ ; Z ,oracloseapproximation of it. Each synthetic rep com dataset, D ð i ; j Þ ,then comprises X ; Y ð i ; j Þ ; Y ð i Þ ; I ; R ; Z .The entire collection of M ¼ mn syn rep nrep ð i ; j Þ datasets, D syn ¼ D syn ; i ¼ 1 ;:::; m ; j ¼ 1 ;:::; n ,with labels indicating the nests, is released to the public. We now extend the distributional theory in Reiter (2004) to multivariate quantities. Let Q be a k £ 1vector-valued estimand, such as avector of regression coefficients. Let Q ( i , j ) ð i ; j Þ ( i , j ) be the estimate of Q computed with D syn ,and let U be the estimateofthe k £ k covariance matrixof Q ( i , j ) .The following quantities are needed for inferences.
1 X m X n 1 X m Q ¼ Q ð i ; j Þ ¼ Q ð i Þ mn m i ¼ 1 j ¼ 1 i ¼ 1 1 X m X n U ¼ U ð i ; j Þ mn i ¼ 1 j ¼ 1 1 X m 1 X n 1 X m W ¼ ð Q ð i ; j Þ 2 Q ð i Þ ÞðQ ð i ; j Þ 2 Q ð i Þ Þ 0 ¼ W ð i Þ m n 2 1 m i ¼ 1 j ¼ 1 i ¼ 1 1 X m B ¼ ð Q ð i Þ 2 Q ÞðQ ð i Þ 2 Q Þ 0 m 2 1 i ¼ 1 We alsodefine B ¼ lim B as m !1and n !1, W ð i Þ ¼ lim W ð i Þ as n !1;and, P 1 1 m ð i Þ W 1 ¼ i ¼ 1 W 1 = m .All posterior distributions presented in thisand subsequent sections are based on diffuse prior distributions. To begin, we utilize the assumptions described by Rubin (1987, Chapter 3) for multiple imputation for missing data. Let Q ð i Þ be the estimate of Q that would be obtainedfrom com P ð i Þ m ð i Þ D com prior to replacement of confidential values, and let Q com ¼ i ¼ 1 Q com= m .Wethen can write Rubin’s (1987) well-knownresult as ð Q j D com; B 1 ; W 1 Þ , N ð Q com; U þð1 þ 1 = m Þ B 1 Þð1 Þ An implicit assumption here is that each U ( i , j ) has sufficientlylow variabilitysothat ð i ; j Þ ð i Þ ð i Þ ð i Þ ð i Þ U < U com,where U com is the varianceestimate of Q com computed from D com. ð i Þ Similarly, we assumethat the U com < U ,and hence U < U ,where U is the variancethat 304 JournalofOfficial Statistics
would be obtainedfrom D inc ¼ðX ; Y inc; I Þ ,i.e., if all the data were observed. These are typical assumptions in multipleimputation, motivated by the fact that posterior variances generally have lower order variabilitythan posterior means (Rubin 1987, p. 89). ð i ; j Þ ð i Þ ð i Þ Following Reiter (2004),weassumethe sampling distributions Q , NQcom; W 1 for all ( i, j ), so that ð i Þ ð i Þ ð i Þ ð i Þ Q comj D syn; B 1 ; W 1 , NQ ; W 1 = n ð 2 Þ ð i Þ Integrating (1) and (2) with respect to the collection of Q com,wehave Q j D syn; B 1 ; W 1 , NQ; T 1 ð 3 Þ where T 1 ¼ U þð1 þ 1 = m Þ B 1 þ W 1 = ð mnÞ .Wenote that the fractionalincreaseinthe 2 1 varianceof Q due to missing data is ð 1 þ 1 = m Þ B 1 U and due to replacement data is 2 1 { W 1 = ð mnÞ } U . ¯ In practice, B 1 and W 1 are notknown and mustbeintegrated out of (3). To do so, we ð i Þ ð i Þ utilize the sampling distributions for Q com from Rubin (1987), Q com , N ð Q obs; B 1 Þ for all i .Here, Q obs is the estimateof Q that would be obtained from D obs.Combining these distributions with the sampling distribution underlying (2), we have ð i Þ ð i Þ ð i Þ Q j D obs; B 1 ; W 1 , NQobs; B 1 þ W 1 = n ð 4 Þ Using the sampling distributionof Q ( i, j ) ,wehave no ð i Þ ð i Þ 2 1 W W 1 j D syn , Wið n 2 1 ; I Þð5 Þ where Wi( n 2 1, I )isaWishartdistributionwith(n 2 1) degrees of freedom and scale matrix I . ð i Þ Finally, from (4) and (5) and the simplifyingassumption that W 1 ¼ W 1 for all i ,wehave 2 1 { B ð B 1 þ W 1 = n Þ j D syn; W 1 } , Wið m 2 1 ; I Þð6 Þ 2 1 { W ð W 1 Þ j D syn} , Wið m ð n 2 1 Þ ; I Þð7 Þ For sufficiently large s , m ,and n ,wecan replace B 1 and each W 1 with their approximate expected values, resulting in the variance estimate T ¼ð1 þ 1 = m Þ B 2 ð 1 = n Þ W þ U . Analysts can base inferences for Q on the distribution, ð Q 2 Q Þ , N ð 0 ; T Þð8 Þ
The fractionalincreaseinvariancedue to missing data is estimated from D syn to be ð 1 þ 1 = m ÞðB 2 W = n Þ U 2 1 .The estimated fractionalincrease due to replacement data is { W = ð mnÞ } U 2 1 .For each of these, the average fractional increase across components of Q equalsthe average of the diagonal elements of thesematrices.
3. Wald-type Tests
Using the M released datasets, an analyst seeks to test the null hypothesis Q ¼ Q 0 ,for exampletotestif k regression coefficients equal zero. In this section, we first argue and demonstrate that the natural test based on the Wald teststatistic for (8) can be poorly calibrated. We then derive an alternative testthat tendstohave better properties. Kinney and Reiter: Tests of Multivariate Hypotheses 305
3.1. Poor Properties of Test Based on WaldTest Statistic
Given the normal approximation for inferences about Q in (8), it may appearreas onable to 0 2 1 2 use the test statistic, D ¼ðQ 0 2 Q Þ T ð Q 0 2 Q Þ ,and the p -valueequal to pr x k . D , 2 where x k is achi-squared random variablewith k degrees of freedom. However, this test is unreliable when k is largeand m and n are moderate, as is frequently the case,because B or W ¯ can have large variability. Estimating B or W ¯ in such casesisakin to estimating a covariance matrixusingfew observations compared to the number of dimensions. We can illustrate the poor properties of tests based on D with simulation studies.
For sample size s ¼ 1 ; 000, we simulate the complete data, { Y 0 ; Y 1 ;:::; Y 20}, from independentnormaldistributions with E ð Y i Þ¼0for all i , varð Y 0 Þ¼1, and varð Y i Þ¼2for i . 0. To simulatemissing data, for computational simplicity we make 30% of the observations have { Y 1 ;:::;Y 20}missing completely at randomand Y 0 always fully observed. We obtain the set of completed datasets, D com,bydrawing values of the missing data from f ð Y 1 ;:::;Y 20j D obsÞ ,using amultivariate normal distribution with an unrestricted covariance matrix. To simulatepartialsynthesis, we replace all values of Y 0 . ð i ; j Þ ð i Þ The replacement imputations for each D syn are drawn independently from f ð Y 0 j D comÞ .We vary the number of imputations according to m [ (4,8) and n [ (2,4,8), which are in the range of values likely to be used by agencies releasing data. For example, the SIPP synthetic data use ( m ¼ 4, n ¼ 4). We testthe null hypothesis Q ¼ 0, where Q is the vector of coefficients for the regression of Y 0 on Y 1 ;:::; Y k ,excluding the intercept,for k [ (5, 10, 20). Table 1 2 summarizes the simulated rejection rates of the testbased on pr x k . D .Results are based on 10,000 runs of the simulation for each combination of m , n ,and k ,for significance levels a [ (1%,5%, 10%). Therejection rates for the test far exceed the nominal a levels, often by so muchastomake the test essentially useless.This problem is alleviated by making m and n excessively large. In the simulation scenario just described, setting m and n to 50 yielded rejection rates much closer to the desired levels; however, doing so in practice is impractical. In addition to computational and analytic burden, releasing so many datasets can result in increased risk of disclosure. Another problem with the test statistic D is that the variance T can have negative diagonal elements, which can result in negativevalues of D .Thisismost likelytooccur for large values of k when n is small.This occurred in asimulation test about 20% of the time