Journal of Official , Vol. 26, No. 2, 2010, pp. 301–315

Tests of Multivariate Hypotheses when using Multiple forMissing and DisclosureLimitation

Satkartar K. Kinney1 and Jerome P. Reiter2

Several statistical agencies use, or are considering the use of, multiple imputation to limit the risk of disclosing respondents’ identities or sensitive attributes in public use data files. For example, agencies can release partially synthetic datasets, comprising the units originally surveyed with some values, such as sensitive values at high risk of disclosure or values of key identifiers, replaced with multiple imputations. This can be coupled with multiple imputation for missing data in atwo-stage imputation approach. First the agency fills in the missing data to generate m completed datasets, then replaces sensitive or identifying values in each completed dataset with n imputed values. Methods for obtaining inferences with the mn datasets have been developed for scalar quantities, but not for multivariate quantities. We present methods for testing multivariate null hypotheses with such datasets. We illustrate the tests using public use files for the Survey of Income and Program Participation that were created with the two-stage imputation approach.

Key words: Confidentiality; disclosure; multiple imputation; significance tests; .

1. Introduction Statistical agencies and otherorganizations that disseminate data to the public are ethically, practically,and often legally required to protect the confidentiality of respondents’ identities and sensitive attributes. To satisfy theserequirements, Rubin (1993) and Little (1993) proposed that agencies utilize multiple imputation approaches. For example, agencies can releasethe units originally surveyed with somevalues, such as sensitive values at high risk of disclosure or values of key identifiers, replacedwith multipleimputations. These are calledpartially synthetic datasets (Reiter 2003). In recent years, statistical agencies have begun to use partially synthetic approaches to create public use data for major surveys. For example, in 2007 the U.S. Bureau released apartially synthetic, public use file for the Survey of Income and Program Participation (SIPP) that includesimputed values of SocialSecurity benefits information anddozensofother highly sensitivevariables (www.sipp.census.gov/sipp/synth data.html). The U.S. Census Bureau alsoplanstoprotect the identities of people in group quarters(e.g., prisons, shelters) in the next releaseofpublic use files from the American Communities Survey by replacing demographic data for peopleathigh

1 National Institute of Statistical Sciences, PO Box 14006, Research Triangle Park, NC 27709, U.S.A. Email: [email protected] 2 Department of Statistical Science, Duke University, Box 90251, Durham, NC 27708, U.S.A. Email: [email protected] Acknowledgments: This research was supported by the U.S. National Science Foundation grant ITR-0427889. q Statistics Sweden 302 JournalofOfficial Statistics disclosure risk with imputations.Partially synthetic, public use datasets are in the development stage for the U.S. Census Bureau’s Longitudinal Business Database, Longitudinal Employer-Household Dynamics survey, and American Communities Survey veteransand full sample data. Statistical agencies in Canada, Germany (Drechsler et al. 2007), and New Zealand (Graham and Penny 2005) alsoare investigating the approach. Other applications of partially synthetic data are described by Kennickell (1997), Abowd and Woodcock (2001, 2004), Abowd and Lane (2004),Little et al. (2004), Reiter (2005b), Mitra and Reiter (2006),Reiter and Mitra (2008), An and Little (2007),and Reiter and Raghunathan (2007). In addition to protectingconfidentiality, statisticalagencies releasing public use data nearly always have to deal with surveynonresponse.Agencies convenientlycan handle the missing data and confidentiality protection simultaneously with atwo-stage multiple imputation approach (Reiter 2004). First, the agencyuses multiple imputation to fill in the missing data, generating m completed datasets. Second, the agencyreplacesthe values at risk of disclosure in each imputed dataset with n multipleimputations, ultimately releasing mn datasets. Thenesting of imputations enables analyststosimply and properly account for the different sources of variabilityarising from the two types of imputations, which is not straightforwardwithout nesting (Reiter 2004). This two-stage approach was used to create synthetic public use files for the Survey of Income and Program Participation and is being considered for other partially synthetic products at the U.S. Census Bureau. Methods for obtaining inferences with such two-stage synthetic datasets have been developed for scalar quantitiesbut not for multivariate quantities. Given the importanceof the SIPP –itisthe largest and most widely used data source on people on public assistance in the U.S. –and the growing interest in using multiple imputation for releasing confidential data,there is aclear need for methodology for obtaining multivariate inferences with such datasets. This article helpsaddress this need by presenting methods for large sample tests of multivariate null hypotheses when multiple imputation is used simultaneously for missing and partially synthetic data. The remainder of the article is organizedasfollows. In Section 2, we review the two- stage procedure of Reiter (2004) and extend its distributional theory to multivariate estimands.InSection 3, we use this theory to derive aWald-like test for multivariate null hypothesesand illustrate the test on the partially synthetic, public use data for the SIPP. In Section 4, we describeanasymptotically equivalent test based on likelihood ratio statistics.Finally, in Section 5, we provide some concluding remarks.

2. Multiple Imputation forMissing Data and Disclosure Limitation For afinite population of size N ,let I ¼ 1ifunit l is includedinthe survey,and I ¼ 0 l P l otherwise, where l ¼ 1 ;:::; N .Let I ¼ðI 1 ;:::;I N Þ ,and let the sample size s ¼ I 1 . Let X be the N £ d matrix of design variables, e.g., stratum or clusterindicators or size measures. We assume that X is knownapproximately for the entire population, for examplefrom census records or the sampling frame(s). Let Y be the N £ p matrix of survey data for the population. Let Y inc ¼ðY obs; Y misÞ be the s £ p submatrix of Y for all units with I l ¼ 1, where Y obs is the portion of Y inc that is observed and Y mis is the portion of Y inc that is missing due to nonresponse. Let R be an N £ p matrix of indicators such that Kinney and Reiter: Tests of Multivariate Hypotheses 303

R lk ¼ 1ifthe response for unit l to item k is recorded, and R lk ¼ 0otherwise. Theobserved data is thus D obs ¼ðX ; Y obs; I ; R Þ . To generate the synthetic data, the agencyfirst fills in values for Y mis with draws from the conditional distribution of ( Y misj D obs), or approximations of that distributionsuch as those of Raghunathan et al. (2001). These drawsarerepeatedindependently i ¼ 1 ;:::; m ð i Þ ð i Þ times to obtain m completed datasets, D com ¼ D com ¼ D obs; Y mis ; i ¼ 1 ;:::; m . Having dealtwiththe missing data, the agencylimitsdisclosure risks by replacing selected ð i Þ ð i Þ values in each D com with multiple imputations. For each D com,imputations are made independently j ¼ 1 ;:::;n timestoyield n different partially synthetic data sets.

Let Z l ¼ 1ifunit l is selected to have any of its data replacedwith synthetic values, and ð i ; j Þ let Z l ¼ 0for thoseunits with all data left unchanged. Let Z ¼ðZ 1 ;:::; Z s Þ .Let Y rep be ð i Þ all the imputed (replaced) values in the j th synthetic dataset associated with D com,and let Y ð i Þ be all unchanged (not replaced) values of D ð i Þ .The Y ð i ; j Þ are generated from the nrep com rep conditional distribution of Y ð i ; j Þ j D ð i Þ ; Z ,oracloseapproximation of it. Each synthetic rep com dataset, D ð i ; j Þ ,then comprises X ; Y ð i ; j Þ ; Y ð i Þ ; I ; R ; Z .The entire collection of M ¼ mn syn rep nrep ð i ; j Þ datasets, D syn ¼ D syn ; i ¼ 1 ;:::; m ; j ¼ 1 ;:::; n ,with labels indicating the nests, is released to the public. We now extend the distributional theory in Reiter (2004) to multivariate quantities. Let Q be a k £ 1vector-valued estimand, such as avector of regression coefficients. Let Q ( i , j ) ð i ; j Þ ( i , j ) be the estimate of Q computed with D syn ,and let U be the estimateofthe k £ k matrixof Q ( i , j ) .The following quantities are needed for inferences.

1 X m X n 1 X m Q ¼ Q ð i ; j Þ ¼ Q ð i Þ mn m i ¼ 1 j ¼ 1 i ¼ 1 1 X m X n U ¼ U ð i ; j Þ mn i ¼ 1 j ¼ 1 1 X m 1 X n 1 X m W ¼ ð Q ð i ; j Þ 2 Q ð i Þ ÞðQ ð i ; j Þ 2 Q ð i Þ Þ 0 ¼ W ð i Þ m n 2 1 m i ¼ 1 j ¼ 1 i ¼ 1 1 X m B ¼ ð Q ð i Þ 2 Q ÞðQ ð i Þ 2 Q Þ 0 m 2 1 i ¼ 1 We alsodefine B ¼ lim B as m !1and n !1, W ð i Þ ¼ lim W ð i Þ as n !1;and, P 1 1 m ð i Þ W 1 ¼ i ¼ 1 W 1 = m .All posterior distributions presented in thisand subsequent sections are based on diffuse prior distributions. To begin, we utilize the assumptions described by Rubin (1987, Chapter 3) for multiple imputation for missing data. Let Q ð i Þ be the estimate of Q that would be obtainedfrom com P ð i Þ m ð i Þ D com prior to replacement of confidential values, and let Q com ¼ i ¼ 1 Q com= m .Wethen can write Rubin’s (1987) well-knownresult as ð Q j D com; B 1 ; W 1 Þ , N ð Q com; U þð1 þ 1 = m Þ B 1 Þð1 Þ An implicit assumption here is that each U ( i , j ) has sufficientlylow variabilitysothat ð i ; j Þ ð i Þ ð i Þ ð i Þ ð i Þ U < U com,where U com is the varianceestimate of Q com computed from D com. ð i Þ Similarly, we assumethat the U com < U ,and hence U < U ,where U is the variancethat 304 JournalofOfficial Statistics

would be obtainedfrom D inc ¼ðX ; Y inc; I Þ ,i.e., if all the data were observed. These are typical assumptions in multipleimputation, motivated by the fact that posterior generally have lower order variabilitythan posterior (Rubin 1987, p.89). ð i ; j Þ ð i Þ ð i Þ Following Reiter (2004),weassumethe sampling distributions Q , NQcom; W 1 for all ( i, j ), so that ð i Þ ð i Þ ð i Þ ð i Þ Q comj D syn; B 1 ; W 1 , NQ ; W 1 = n ð 2 Þ ð i Þ Integrating (1) and (2) with respect to the collection of Q com,wehave Q j D syn; B 1 ; W 1 , NQ; T 1 ð 3 Þ where T 1 ¼ U þð1 þ 1 = m Þ B 1 þ W 1 = ð mnÞ .Wenote that the fractionalincreaseinthe 2 1 varianceof Q due to missing data is ð 1 þ 1 = m Þ B 1 U and due to replacement data is 2 1 { W 1 = ð mnÞ } U . ¯ In practice, B 1 and W 1 are notknown and mustbeintegrated out of (3). To do so, we ð i Þ ð i Þ utilize the sampling distributions for Q com from Rubin (1987), Q com , N ð Q obs; B 1 Þ for all i .Here, Q obs is the estimateof Q that would be obtained from D obs.Combining these distributions with the underlying (2), we have ð i Þ ð i Þ ð i Þ Q j D obs; B 1 ; W 1 , NQobs; B 1 þ W 1 = n ð 4 Þ Using the sampling distributionof Q ( i, j ) ,wehave no ð i Þ ð i Þ 2 1 W W 1 j D syn , Wið n 2 1 ; I Þð5 Þ where Wi( n 2 1, I )isaWishartdistributionwith(n 2 1) degrees of freedom and scale matrix I . ð i Þ Finally, from (4) and (5) and the simplifyingassumption that W 1 ¼ W 1 for all i ,wehave 2 1 { B ð B 1 þ W 1 = n Þ j D syn; W 1 } , Wið m 2 1 ; I Þð6 Þ 2 1 { W ð W 1 Þ j D syn} , Wið m ð n 2 1 Þ ; I Þð7 Þ For sufficiently large s , m ,and n ,wecan replace B 1 and each W 1 with their approximate expected values, resulting in the estimate T ¼ð1 þ 1 = m Þ B 2 ð 1 = n Þ W þ U . Analysts can base inferences for Q on the distribution, ð Q 2 Q Þ , N ð 0 ; T Þð8 Þ

The fractionalincreaseinvariancedue to missing data is estimated from D syn to be ð 1 þ 1 = m ÞðB 2 W = n Þ U 2 1 .The estimated fractionalincrease due to replacement data is { W = ð mnÞ } U 2 1 .For each of these, the average fractional increase across components of Q equalsthe average of the diagonal elements of thesematrices.

3. Wald-type Tests

Using the M released datasets, an analyst seeks to test the null hypothesis Q ¼ Q 0 ,for exampletotestif k regression coefficients equal zero. In this section, we first argue and demonstrate that the natural test based on the Wald teststatistic for (8) can be poorly calibrated. We then derive an alternative testthat tendstohave better properties. Kinney and Reiter: Tests of Multivariate Hypotheses 305

3.1. Poor Properties of Test Based on WaldTest

Given the normal approximation for inferences about Q in (8), it may appearreasonable to 0 2 1 2 use the test statistic, D ¼ðQ 0 2 Q Þ T ð Q 0 2 Q Þ ,and the p -valueequal to pr x k . D , 2 where x k is achi-squared random variablewith k degrees of freedom. However, this test is unreliable when k is largeand m and n are moderate, as is frequently the case,because B or W ¯ can have large variability. Estimating B or W ¯ in such casesisakin to estimating a covariance matrixusingfew observations compared to the number of dimensions. We can illustrate the poor properties of tests based on D with simulation studies.

For sample size s ¼ 1 ; 000, we simulate the complete data, { Y 0 ; Y 1 ;:::; Y 20}, from independentnormaldistributions with E ð Y i Þ¼0for all i , varð Y 0 Þ¼1, and varð Y i Þ¼2for i . 0. To simulatemissing data, for computational simplicity we make 30% of the observations have { Y 1 ;:::;Y 20}missing completely at randomand Y 0 always fully observed. We obtain the set of completed datasets, D com,bydrawing values of the missing data from f ð Y 1 ;:::;Y 20j D obsÞ ,using amultivariate normal distribution with an unrestricted . To simulatepartialsynthesis, we replace all values of Y 0 . ð i ; j Þ ð i Þ The replacement imputations for each D syn are drawn independently from f ð Y 0 j D comÞ .We vary the number of imputations according to m [ (4,8) and n [ (2,4,8), which are in the of values likely to be used by agencies releasing data. For example, the SIPP synthetic data use ( m ¼ 4, n ¼ 4). We testthe null hypothesis Q ¼ 0, where Q is the vector of coefficients for the regression of Y 0 on Y 1 ;:::; Y k ,excluding the intercept,for k [ (5, 10, 20). Table 1 2 summarizes the simulated rejection rates of the testbased on pr x k . D .Results are based on 10,000 runs of the simulation for each combination of m , n ,and k ,for significance levels a [ (1%,5%, 10%). Therejection rates for the test far exceed the nominal a levels, often by so muchastomake the test essentially useless.This problem is alleviated by making m and n excessively large. In the simulation scenario just described, setting m and n to 50 yielded rejection rates much closer to the desired levels; however, doing so in practice is impractical. In addition to computational and analytic burden, releasing so many datasets can result in increased risk of disclosure. Another problem with the test statistic D is that the variance T can have negative diagonal elements, which can result in negativevalues of D .Thisismost likelytooccur for large values of k when n is small.This occurred in asimulation test about 20% of the time

2 Table 1. Simulated rejectionrates in percentages for significance levels a using pr x k . D a ¼ 1% a ¼ 5% a ¼ 10% k Value: 510205 10 20 51020 m ¼ 4 n ¼ 210.0 11.3 21.8 16.0 21.3 38.6 20.7 28.2 49.1 n ¼ 426.0 32.4 33.3 37.1 40.9 40.0 43.9 45.6 43.7 n ¼ 88.0 21.1 54.1 19.4 37.7 72.0 27.6 48.1 79.8 m ¼ 8 n ¼ 211.8 10.7 10.2 17.2 15.1 16.8 21.0 18.7 22.8 n ¼ 411.7 36.0 39.3 22.2 48.0 45.9 30.0 55.1 49.8 306 JournalofOfficial Statistics with small values of n ,with the percentage fading to zero when m ¼ 4and n ¼ 8. An ad-hocadjustment, used in Table 1, is to replace T with U þð1 þ 1 = m Þ B (Reiter 2008).

3.2. An ImprovedWald-type Test Rubin (1987), Li et al. (1991a,b), and Reiter (2005a) identified similar issues in multivariate testing in single-stage multiple imputation procedures. To mitigate the effects of variability, they reduce the number of unknownparameters in B 1 (there is no W 1 in one stage imputation) and derive an alternative to the natural statistic. We adapt this strategy for two-stage partially synthetic data to proposeanimproved Wald-liketest. Our adaptation deals with both varianceestimates ( B and W )that can make D unstable. We first present the test statistic and itsreference distribution for the improved Wald- like test, followed by the derivation. The statistic is 0 2 1 ð b Þ ð w Þ S ¼ðQ 0 2 Q Þ U ð Q 0 2 Q Þ = { k ð 1 þ r 2 r Þ } where

r ð b Þ ¼ð1 þ 1 = m Þ trð B U 2 1 Þ = k ð 9 Þ r ð w Þ ¼ð1 = n Þ trð W U 2 1 Þ = k ð 10Þ The reference distributionisapproximated by an F distributionwhere k ; w s {1 þðð r ð b Þ v Þ = ð v 2 2 ÞÞ 2 ððr ð w Þ v Þ = ð v 2 2 ÞÞ} 2 w ¼ 4 þ b b w w ð 11Þ s ð b Þ 2 2 ð w Þ 2 2 ððr v b Þ Þ = ððv b 2 2 Þ ð v b 2 4 ÞÞ þðð r v w Þ Þ = ððv w 2 2 Þ ð v w 2 4 ÞÞ for v b . 4and v w . 4, and v b ¼ k ð m 2 1 Þ and v w ¼ kmð n 2 1 Þ .Weprovide an alternate degrees of freedom for the special case where v b # 4or v w # 4atthe end of this section. The approximate p -valuefor testing Q ¼ Q is given by prð F . S Þ . 0 k ; W s

3.2.1. Derivation of Test

Conditional onT 1 and using standardmultivariate theory with (3), the p -value for testing 2 0 2 1 Q ¼ Q 0 is pr x k . ð Q 0 2 Q Þ T 1 ð Q 0 2 Q Þ .Since T 1 is unknown, we average this probability over the distribution of T 1 ,orequivalently over the distributions of ð B 1 j D syn; W 1 Þ and ð W 1 j D synÞ in (6) and (7). Averaging over the varianceparameters, the p -value equals ð 2 0 2 1 pr x k . ð Q 0 2 Q Þ T 1 ð Q 0 2 Q ÞjD syn; B 1 ; W 1

£ prð B 1 j D syn; W 1 Þ prð W 1 j D synÞ dB1 d W 1 This integral can be evaluated numerically, but it is desirable to have asimple, closed-form approximation for analysts of public use data, who may not have the skillsordesire to perform the numerical integration. ð b Þ ð w Þ ð w Þ ð b Þ For the approximation,weset B 1 ¼ r 1 U 1 and W 1 ¼ r 1 U 1 ,where r 1 and r 1 are scalar quantitiesnot assumed to be equal.These equations are true if (i) the fractions of missing information are equal for all components of Q ,and (ii) the fractions of replaced information are equal for all components of Q .These conditions do not strictly hold in all Kinney and Reiter: Tests of Multivariate Hypotheses 307 surveys; however, in practice, rates of missing and replaced information often do not vary substantially by variable. In such cases, the stabilization in the estimate of T 1 resulting from theseapproximationscan leadtotests with better properties than tests based on D . We illustrate this using simulations inSection 3.2.2. ð b Þ ð w Þ Under these conditions, T 1 ¼ U 1 1 þð1 þ 1 = m Þ r 1 þ r 1 = mn ,sothat the number of parameters to be estimated for each of B 1 and W 1 is reduced from k ( k þ 1)/2 to 1, thereby stabilizing the estimation of T 1 .Assuming U < U 1 ,the p -value becomes

ð () ð Q 2 Q Þ 0 U 2 1 ð Q 2 Q Þ pr x 2 . 0 0 j D ; r ð b Þ ; r ð w Þ k ð b Þ ð w Þ syn 1 1 1 þð1 þ 1 = m Þ r 1 þ r 1 = ð mnÞ ð b Þ ð w Þ ð w Þ ð b Þ ð w Þ £ pr r j D syn; r pr r j D syn dr dr 1 1 1 1 1 ð 12Þ ð 1 þð1 þ 1 = m Þ r ð b Þ þ r ð w Þ = ð mnÞ ¼ pr x 2 = k 1 1 . S j D ; r ð b Þ ; r ð w Þ k ð 1 þ r ð b Þ þ r ð w Þ Þ syn 1 1 ð b Þ ð w Þ ð w Þ ð b Þ ð w Þ £ pr r 1 j D syn; r 1 pr r 1 j D syn dr1 dr1 ð b Þ ð w Þ ð w Þ The posterior distributions of r 1 j D syn; r 1 and of r 1 j D syn can be obtained from (6) and (7). Applyingstandard multivariate normal theory, we have () k ð m 2 1 Þ trð B U 2 1 Þ = k j D ; r ð w Þ , x 2 ð 13Þ ð b Þ ð w Þ syn 1 k ð m 2 1 Þ r 1 þ r 1 = n

kmð n 2 1 Þ trð W U 2 1 Þ = k j D , x 2 ð 14Þ ð w Þ syn kmð n 2 1 Þ r 1

Substituting (13) and (14) into (12),after some algebra we have () ð b Þ 2 ð w Þ 2 1 þ v b r = x 2 v w r = x pr x 2 = k v b v w . S ð 15Þ k 1 þ r ð b Þ 2 r ð w Þ

We approximate the random variablein(15) as proportional to an F -distributed , F ,sothat the p -value is prð d F . S Þ .The approximation is obtained k ; w s k ; w s by the first twomomentsof d F to thoseofthe left-hand side of the inequality k ; w s in (15). Equivalently, we approximate1þ x 2 2 v r ð b Þ 2 x 2 2 v r ð w Þ as proportional v b b v w w to an inverse chi-square distributed randomvariable with degrees of freedom w s by matching the first two momentstothe distribution hx2 2 .Using iteratedexpectations and w s variances, we have

ð b Þ ð w Þ 2 2 h v b r v w r E hxw ¼ < 1 þ 2 s ð w s 2 2 Þ v b 2 2 v w 2 2 308 JournalofOfficial Statistics and 2 h 2 E hx2 2 ¼ w s ð w s 2 2 Þðw s 2 4 Þ

2 ð v r ð w Þ Þ 2 2 ð v r ð b Þ Þ 2 no2 < w þ b þ E hx2 2 2 2 w s ð v w 2 2 Þ ð v w 2 4 Þ ð v b 2 2 Þ ð v b 2 4 Þ

Solving yields the expression in (11) for w s and d ¼ { ð w s 2 2 Þ = w s } ð b Þ ð w Þ ð b Þ ð w Þ {1 þ v b r = ð v b 2 2 Þ 2 v w r = ð v w 2 2 Þ } = ð 1 þ r 2 r Þ .When v b and v w are suffi- ciently large, d < 1, and the approximate p -value is prð F . S Þ . k ; w s It is possiblethat v b # 4or v w # 4, in which case w s is not defined. This can occur for small k when m ¼ 2, achoice for m that is not recommended due to the potentially high probability that estimated variances for scalar quantities are less than zero (Reiter 2008).

Nonetheless, if analystsfind that v b # 4, we suggestthe alternatedenominator degrees of freedom, ð r ð b Þ Þ 2 ð r ð w Þ Þ 2 2 1 w * ¼ þ ð 16Þ s ð b Þ ð w Þ 2 ð b Þ ð w Þ 2 v b ð 1 þ r 2 r Þ v w ð 1 þ r 2 r Þ This is ageneralization of the degrees of freedom used in the t -distribution of Reiter (2004) for inferences for scalar Q .Details of its derivationcan be found in Kinney (2007).

3.2.2. Illustration of ImprovedPerformance To illustrate the improved performanceofthis test, we repeat the simulations of Section 3.1. In this simulation scenario, the fractions of missing information on each component of Q are equal, as are the fractions of replacedinformation for each component of Q .Table 2displays the simulated significance levels for the null hypothesis Q ¼ 0, where Q is the vector of coefficients for the regression of Y 0 on Y 1 , ::: , Y k ,excluding the intercept,for k [ (5, 10, 20). The simulated levels are muchcloser to the a -levels than those based on tests with D (displayed in Table 1). Additionally, S was observed to be positive in all 10,000 runs for each scenario. In some applications of partially synthetic data,including SIPP, several variables are replaced in entirety while others are left unchanged. In such cases, when Q involves both

Table 2. Simulated rejection rates in percentages for significancelevels a using prð F . S Þ k ; w s a ¼ 1% a ¼ 5% a ¼ 10% k Value: 510205 10 20 51020 m ¼ 4 n ¼ 20.1 0.3 0.6 1.8 3.0 4.2 5.1 7.3 9.2 n ¼ 40.7 1.0 1.2 3.9 5.1 5.3 8.7 10.2 10.7 n ¼ 81.0 1.2 1.0 4.7 5.0 5.4 9.4 10.3 10.5 m ¼ 8 n ¼ 20.4 0.7 0.9 3.1 4.2 5.1 7.2 9.1 10.3 n ¼ 40.8 1.2 1.2 5.1 5.3 5.6 10.3 10.7 10.9 Kinney and Reiter: Tests of Multivariate Hypotheses 309 replacedand unreplaced variables the fractions of replaced information are not equal ð w Þ across componentsof Q ,i.e., W 1 – r 1 U 1 .Additionally, it is commonly the case that the ð w Þ fraction of missing information is not equal across components of Q and so W 1 – r 1 U 1 . To gain insight on the performance of the procedureswhen the proportionality conditions do not hold, we can turn to the literature on significance testing in multiple imputation for missing data only. Using simulations, Li et al. (1991a)show that tests based on the condition, B 1 ¼ r 1 U 1 ,are robust in casesofpractical interest whenthat condition does not hold. That is, tests based on the proportionality condition are better calibrated than those based on the corresponding natural Wald test. This suggests that tests based on S still should outperform thosebased on D in cases of practical interest. We illustrate the robustness of the proposed test to violationsofboth proportionality assumptions by modifying the simulation scenario of Table 2. We create unequalfractions of missing information by letting Y 0 , :::, Y 10 be completely observed and Y 11, ::: , Y 20 be 30% missing. We create unequalfractions of replacedinformation by replacing all values in Y 0 , ::: , Y 10 with imputations in the second stage and leaving Y 11, ::: , Y 20 unchanged in the secondstage of imputation. We set k ¼ 20 and test Q ¼ 0, where Q is the vector of coefficients from the regression of Y 0 on Y 1 , ::: , Y 20.Table 3givesthe simulated rejection rates over 1,000 iterations. These are slightly conservativebut close the desired levels.While not shown, tests based on D continue to be very poorly calibrated.

3.3. Application With SIPP Public Use Data We now illustrate the application of the Wald-like test using the partially synthetic data from the SIPP public use files. We first provide abrief overview of the SIPP, followed by the application. The SIPP is acontinuousseries of nationalpanelsdesigned to collect data on income, laborforceinformation, participation and eligibility for governmentalassistance programs, and general demographic characteristicsfor individualsonpublic assistance. It can be used for both longitudinal and cross-sectionalanalyses,including assessments of the effectiveness and impacts of changesinpublic assistance programs, the distributions of wealth across different demographic groups, and the factorsthat affect changesin household and family structures(www.sipp.census.gov/sipp/analytic.html).The national panels range in size from approximately 14,000 to 36,700 interviewed households and last from two and ahalf to four years. Households are selected in amultistage, stratified

Table 3. Simulated rejection rates in percentages for significance levels a with k ¼ 20 using prð F . S Þ k ; w s when the proportionality assumptions are not valid a ¼ 1% a ¼ 5% a ¼ 10% m ¼ 4 n ¼ 21.1 5.310.4 n ¼ 41.2 5.810.6 n ¼ 81.5 5.310.4 m ¼ 8 n ¼ 21.2 5.710.9 n ¼ 41.3 5.510.5 310 JournalofOfficial Statistics sampling design (www.sipp.census.gov/sipp/overview.html). Analysts can download de-identified data from the SIPPwebsite. In 2001, the U.S. Census Bureau,the Internal Revenue Service, and the Social Security Administration decided to supplement the information on SIPP panelsfrom 1990–1996 with detailed earnings and Social Security benefits histories.Linking these data allows researchers to study retirement and disability programsand their interactions with other public assistance programs. Because of the highly sensitive nature of these supplemental data, the three agencies agreed to releaseaversion of the linked data only if sensitive and identifying information was synthesized.Inthe end, the agencies determined that it was necessarytosynthesize all butfour out of over six hundredvariables in the linked data. The linked data (before synthesis) also containsalarge number of missing values, some due to the panel structure and others to survey or administrative database nonresponse. Therefore, the team developing the synthesis used the two stage approach to creating synthetic data.First, they generated m ¼ 4completed datasetsusing acombination of sequential regression multivariate imputation (Raghunathan et al. 2001) and Bayesian bootstraps (Rubin1981). Then, for each completed dataset, they generated n ¼ 4synthetic copies by replacing all values of the sensitive variables. Thesynthesis was done using sequential regression multivariate imputation and akerneldensity regression technique developed by Woodcock and Benedetto (2006). These M ¼ 16 datasets are released to the public and available for downloading on the SIPP website. For details of the synthesis procedure,see Abowd et al. (2006). Abowd et al. (2006) reportonseveral estimands and regressions of interest usingSIPP data, providing univariate confidence intervals computed using the methods of Reiter (2004), and comparing with estimates from the completed data prior to synthesis. As a practical example, we illustrate amultivariate test usingthe regression of log total family income in 1999 against number of children, year of birth, indicatorsfor gender, black, Hispanic, foreign born, and disabled, and categoricalvariables indicating education level, marital status,and type of benefits received. The multivariate test was applied to see if a 4-level categoricalvariableindicating industry type should be included in the regression model. In this case, test statistic D was negative, yielding a p -value of 1when compared 2 to a x 3 -distribution; when adjustedasinSection 3.1, the test statistic was 203, yielding a p -value of 10.Onthe otherhand, the test statistic S hadavalueof32, which yielded a p -value of .0004 whencompared to the F 3,w -distribution, where w ¼ 8 : 89, suggesting that the industry variableisasignificant predictorofincome.

4. Test Based on LikelihoodRatio Statistics The Wald-liketest requiresaccess to all elements of the U ( i, j ) matrices. This may be cumbersome when the dimension of U ( i, j ) is large. We now present atestbased on the set of log-likelihoodratio statistics from the completed datasets. This test is similar in spirit to those developed by Meng and Rubin (1992) and Shen (2000) for multipleimputation for missing data only and by Reiter (2005a)for multiple imputation for synthetic data only. As before, we first present the test before outlining its derivation. Following the notation in Schafer (1997),let c be the vector of parameters in ^ ð i ; j Þ ^ ð i ; j Þ the analyst’s model. Let c 0 and c be the maximum likelihood estimates of Kinney and Reiter: Tests of Multivariate Hypotheses 311

Q computedwith D ð i ; j Þ under the null and alternative hypotheses,respectively. Let P syn P P P ð i Þ n ^ ð i ; j Þ ð i Þ n ^ ð i ; j Þ m ^ ð i Þ m ð i Þ c ¼ j ¼ 1 c = n ; c 0 ¼ j ¼ 1 c 0 = n ; c ¼ i ¼ 1 c = m ;and, c 0 ¼ i ¼ 1 c 0 = m . We write the log-likelihoodratio statistic evaluated at anytwovalues a and b for any ð i ; j Þ 0 ð i ; j Þ ð i ; j Þ ð i ; j Þ dataset D syn as d a ; b j D syn ¼ 2log fDsyn j a 2 2log fDsyn j b .The test statistic is

S~ ¼ L = { k ð 1 þ r~ ð b Þ 2 r~ ð w Þ Þ } ð 17Þ where

ð b Þ r~ ¼ { ð m þ 1 ÞðL m 2 L Þ } = { k ð m 2 1 Þ }

ð w Þ r~ ¼ðl 2 L m Þ = { k ð n 2 1 Þ } and

X m X n 0 ð i ; j Þ L ¼ d c 0 ; c j D syn = ð mnÞ i ¼ 1 j ¼ 1

X m X n 0 ð i Þ ð i Þ ð i ; j Þ L m ¼ d c 0 ; c j D syn = ð mnÞ i ¼ 1 j ¼ 1

X m X n 0 ^ ð i ; j Þ ^ ð i ; j Þ ð i ; j Þ l ¼ d c 0 ; c j D syn = ð mnÞ i ¼ 1 j ¼ 1

The reference distribution for S~ is an F -distribution with k degrees of freedom in the numerator and w ~ s degrees of freedom in the denominator, where w ~ s is the expression in ( b ) ( w ) ð b Þ ð w Þ (11) with the terms r and r replacedby r~ and r~ .When v b # 4or v w # 4, we use the denominator degrees of freedom in (16),substituting in r~ ð b Þ and r~ ð w Þ as above. The derivationparallelsthe strategy of Meng and Rubin (1992),namely (i) finda statistic asymptotically equivalent to S based only on the Wald statistics from each synthetic dataset;(ii) use the asymptotic equivalence of Wald and log-likelihood ratio test statistics for individual datasetstodefine the test statistic S~ ;and, (iii) find areference F -distribution as in the Wald tests. ð i ; j Þ ð i ; j Þ ð i ; j Þ 0 ð i ; j Þ 2 1 ð i ; j Þ To begin, let d ð Q ; U Þ¼ð Q 2 Q 0 Þ U ð Q 2 Q 0 Þ for all ( i, j ). Because of the asymptotic equivalence of Wald and log-likelihood ratio test statistics, each ð i ; j Þ ð i ; j Þ 0 ^ ð i ; j Þ ^ ð i ; j Þ ð i ; j Þ d ð Q ; U Þ is asymptoticallyequivalent to its corresponding d c 0 ; c j D syn . Furthermore, because of the low-order variabilityinthe U ( i , j ) ,wecan interchange the U ( i , j ) with U in any of d ð Q ð i ; j Þ ; U ð i ; j Þ Þ , d ð Q ð i Þ ; U ð i ; j Þ Þ ,or d ð Q ; U ð i ; j Þ Þ . P P P Let d ¼ m n d ð Q ð i ; j Þ ; U ð i ; j Þ Þ = ð mnÞ ;let d ð i Þ ¼ n d ð Q ð i Þ ; U ð i ; j Þ Þ = n ;and, let P P i ¼ 1 j ¼ 1 j ¼ 1 ^ m n ð i ; j Þ d ¼ i ¼ 1 j ¼ 1 d ð Q ; U Þ = ð mnÞ .Then S is equivalent to

ð d = k Þ 2 ð n 2 1 Þ r ð w Þ 2 ð m 2 1 Þ r ð b Þ = ð m þ 1 Þ S * ¼ ð 18Þ 1 þ r ð b Þ 2 r ð w Þ where r ( b ) and r ( w ) are defined in (9) and (10).Toshow this, we assume without loss of generality that Q 0 ¼ 0and U is a k £ k identity matrix, as in Rubin (1987, p. 100). Then, 312 JournalofOfficial Statistics

S ¼ Q 0 Q = { k ð 1 þ r ð b Þ 2 r ð w Þ Þ }and, using asums-of-squares decomposition,

1 X m X n d ¼ ð Q ð i ; j Þ 2 Q ð i Þ Þ 0 ð Q ð i ; j Þ 2 Q ð i Þ Þ mn i ¼ 1 j ¼ 1

1 X m X n þ ð Q ð i Þ 2 Q Þ 0 ð Q ð i Þ 2 Q ÞþQ 0 Q mn i ¼ 1 j ¼ 1 k ð m 2 1 Þ ¼ k ð n 2 1 Þ r ð w Þ þ r ð b Þ þ Q 0 Q m þ 1 Substituting the above expression into (18) yields S . Computing r ( b ) and r ( w ) requires access to U ,which we do not wantthese tests to depend on. Expressions that rely only on Wald statistics are obtained by usingsums-of- squares decompositions. Under the canonicalconditions, and without loss of generality, for r ( b ) we have ð m þ 1 Þ X m r ð b Þ ¼ ð Q ð i Þ 2 Q Þ 0 ð Q ð i Þ 2 Q Þ kmð m 2 1 Þ i ¼ 1 ()! ð m þ 1 Þ X m ð m þ 1 Þ X m ¼ ð Q ð i Þ0Q ð i Þ Þ 2 mQ 0 Q < d ð i Þ = m 2 d ^ ¼ r ð b Þ kmð m 2 1 Þ k ð m 2 1 Þ w i ¼ 1 i ¼ 1 P P m ð i Þ m ð i Þ0 ð i Þ ^ since i ¼ 1 d = m is asymptotically equivalent to i ¼ 1 ð Q Q Þ ,and d is asymptotically equivalent to Q 0 Q .For r ( w ) ,wehave

1 X m X n r ð w Þ ¼ ð Q ð i ; j Þ 2 Q ð i Þ Þ 0 ð Q ð i ; j Þ 2 Q ð i Þ Þ kmnð n 2 1 Þ i ¼ 1 j ¼ 1 () 1 X m X n X n ¼ ð Q ð i ; j Þ0Q ð i ; j Þ Þ 2 n ð Q ð i Þ0Q ð i Þ Þ kmnð n 2 1 Þ i ¼ 1 j ¼ 1 i ¼ 1 ! X m ð i Þ ð w Þ < k ð n 2 1 Þ d 2 d = m ¼ r w i ¼ 1 ^ ð b Þ ð w Þ ( b ) ( w ) Using d to approximate the numerator of S ,and r w and r w to approximate r and r in the denominator of S ,weobtain the asymptoticallyequivalent statistic S * . We next utilize the asymptotic equivalence betweenthe Wald statistics and the log- likelihoodratio statistic to show that S~ in (17) is asymptotically equivalent to S * .The ( i , j ) equivalence of l and d followsdirectly from the asymptotic equivalence of the d ( Q , U ij) and their corresponding d 0 c ^ ð i ; j Þ ; c ^ ð i ; j Þ j D ð i ; j Þ .The equivalence of L and d ^ ,and of d ¼ P 0 syn m m ð i Þ i ¼ 1 d = m and L m ,ismore subtle. Using arguments similar to thoseofMeng and Rubin (1992) and Shen (2000), for quadratic complete-data log-likelihood functions, we have

0 ð i ; j Þ ð i ; j Þ ð i ; j Þ ð i ; j Þ ð i ; j Þ d ð c 0 ; c j D syn Þ < d ð Q ; U Þ 2 d ð Q 2 Q ; U Þ

0 ð i Þ ð i Þ ð i ; j Þ ð i ; j Þ ð i ; j Þ ð i ; j Þ ð i Þ ð i ; j Þ d ð c 0 ; c j D syn Þ < d ð Q ; U Þ 2 d ð Q 2 Q ; U Þ Kinney and Reiter: Tests of Multivariate Hypotheses 313

Thus, we have 1 X m X n L ¼ d 0 c ; c j D ð i ; j Þ mn 0 syn i ¼ 1 j ¼ 1

1 X m X n < { d ð Q ð i ; j Þ ; U ð i ; j Þ Þ 2 d ð Q ð i ; j Þ 2 Q ; U ð i ; j Þ Þ } mn i ¼ 1 j ¼ 1

1 X m X n < { d ð Q ð i ; j Þ 2 U Þ 2 d ð Q ð i ; j Þ 2 Q ; U Þ } mn i ¼ 1 j ¼ 1

1 X m X n < d ð Q ; U Þ d ð Q ; U ð i ; j Þ Þ¼d ^ mn i ¼ 1 j ¼ 1

P m ð i Þ Similar reasoning shows that i ¼ 1 d = m is asymptoticallyequivalent to L m .Thus, we can ^ ~ replace l with d , L with d ,and L m with d m to obtain the test statistic S and reference F -distribution.

5. Concluding Remarks The simulations suggestthat the improved Wald-liketest provides appropriate rejection rates when the null hypothesisistrue. To get asense of the power properties of these tests, we can turntothe results of Li et al. (1991b), who examined the power properties of large sample significance tests for multipleimputation of missing data only. These tests are derived from similar assumptions and approximationsasthe Wald-like test proposed here. Based on extensive simulation studies, Li et al. (1991b)reportthat powercurves for their tests are similar to the powercurves for Wald tests based on the observed data.The greatestlosses in power occur when the data deviate substantially from the proportionality assumption. The losses are largest when m is small, and mostly disappearfor large m .Shen (2000) reported similar findingsfor nested imputation, with greatest power loss for small m and n and for large deviations from proportionality. Thetests proposed here are expected to have similar properties, though further studyisneeded. Popular software packages contain routinesfor obtaining confidence intervals for scalar quantities and p -values for multicomponent tests from multiply-imputed datasets. These routinescan be easily modified to perform the tests proposed here. As resources available to malicious data userscontinue to expand,the alterations neededtoprotect data with traditional disclosure limitation techniques –such as swapping, adding noise, or microaggregation –may becomesoextreme that, for many analyses,the released data are no longeruseful. Synthetic data, on the otherhand, has the potential to enabledata dissemination while preservingdata utility. The methods in this article enableanalysts of multiply-imputed,partially synthetic public-use data to obtain closer to nominal levels when testing multicomponent null hypotheses than previously possible, thereby increasing the utilityofsynthetic data approaches. 314 JournalofOfficial Statistics

6. References Abowd, J.M. and Lane, J.I. (2004). New Approaches to Confidentiality Protection: Synthetic Data, Remote Access and Research Data Centers. In Privacy in Statistical Databases,J.Domingo-Ferrer and V. Torra (eds). New York: Springer-Verlag, 282–289. Abowd, J.M., Stinson, M.H., and Benedetto, G.L. (2006).Final Report to the Social Security Administration on the SIPP/SSA/IRSPublic Use File Project. Technical Report, U.S. Census Bureau Longitudinal Employer-Household Dynamics Program. Abowd, J.M. and Woodcock, S.D. (2001). Disclosure Limitation in Longitudinal Linked Data.InConfidentiality, Disclosure,and Data Access:Theoryand Practical Applicationsfor Statistical Agencies, P. Doyle, J. Lane, L. Zayatz, and J. Theeuwes (eds). Amsterdam: North-Holland, 215–277. Abowd, J.M. and Woodcock, S.D. (2004). Multiply-imputing Confidential Characteristics and File Links in Longitudinal LinkedData. In Privacy in Statistical Databases, J. Domingo-Ferrer and V. Torra (eds). New York: Springer-Verlag, 290–297. An, D. and Little, R.J.A. (2007). Multiple Imputation: an Alternative to Top Coding for Statistical Disclosure Control. Journal of the Royal Statistical Society, Series A, 170, 923–940. Drechsler, J., Dundler, A., Bender, S., Ra¨ ssler, S., Zwick, T. (2007).ANew Approachfor Disclosure Control in the IABEstablishment Panel–Multiple Imputation for aBetter Data Access.Technical Report,IAB Discussion Paper. No11/2007. Graham, P. and Penny, R. (2005).Multiply Imputed Synthetic Data Files. Technical Report, University of Otago, http://www.uoc.otago.ac.nz/departments/pubhealth/ pgrahpub.htm Kennickell, A.B. (1997). Multiple Imputation and DisclosureProtection: The Case of the 1995 Survey of Consumer Finances. In Record Linkage Techniques,W.Alvey and B. Jamerson (eds). Washington, D.C: National Academy Press, 248–267. Kinney, S.K. (2007). and Multivariate Inference Using Data Multiply Imputed for DisclosureLimitation and Nonresponse. Ph.D. thesis, Duke University, Department of Statistical Science. Li, K.H., Meng, X.L., Raghunathan,T.E., and Rubin, D.B. (1991a). Significance Levels From Repeated p-Values with Multiply-imputed Data. Statistica Sinica, 1, 65–92. Li, K.H., Raghunathan, T.E., and Rubin, D.B. (1991b). Large-SampleSignificance Levels from Multiply-imputed Data Using -based Statistics and an FReference Distribution. Journal of the American Statistical Association, 86, 1065–1073. Little, R.J.A. (1993).Statistical Analysis of Masked Data. Journal of Official Statistics, 9, 407–426. Little, R.J.A.,Liu, F., and Raghunathan, T.E. (2004).Statistical DisclosureTechniques Based on Multiple Imputation. In Applied Bayesian Modeling and Causal Inference from Incomplete-DataPersepectives, A. Gelman and X.L. Meng (eds). New York: John Wiley &Sons, 141–152. Meng, X.L. and Rubin, D.B. (1992). Performing Likelihood Ratio Tests with Multiply- imputed Data Sets.Biometrika, 79, 103–111. Kinney and Reiter: Tests of Multivariate Hypotheses 315

Mitra, R. and Reiter, J.P. (2006).AdjustingSurvey Weights When Altering Identifying Design Variables via Synthetic Data. In Privacy in Statistical Databases 2006 (Lecture Notes in Computer Science),J.Domingo-Ferrar (ed.).New York: Springer-Verlag, 177–188. Raghunathan, T.E., Lepkowski,J.M., van Hoewyk,J., and Solenberger, P. (2001). AMultivariate Technique for Multiply ImputingMissing Values Using aSeries of Regression Models. , 27, 85–96. Reiter, J.P.(2003).Inference for Partially Synthetic, Public Use Microdata Sets.Survey Methodology, 29, 181–189. Reiter, J.P. (2004). Simultaneous Use of Multiple Imputation for Missing Data and DisclosureLimitation. Survey Methodology, 30, 235–242. Reiter, J.P. (2005a). Significance Testsfor Multi-component Estimands from Multiply- imputed, Synthetic Microdata. Journal of Statistical Planning and Inference, 131, 365–377. Reiter, J.P.(2005b). Using CART to Generate Partially Synthetic, Public Use Microdata. Journal of OfficialStatistics, 21, 441–462. Reiter, J.P. (2008). Selecting the Number of Imputed Datasets When Using Multiple Imputationfor Missing Data and DisclosureLimitation. Statistics and Probability Letters, 78, 15–20. Reiter, J.P. and Mitra, R. (2008). Estimating Risks of Identification Disclosure in Partially Synthetic Data. Journal of Privacy and Confidentiality, 1.1, 99–110. Reiter, J.P. and Raghunathan, T.E. (2007). The Multiple Adaptations of Multiple Imputation. Journal of the AmericanStatistical Association, 102, 1462–1471. Rubin, D.B. (1981).The Bayesian Bootstrap. TheAnnals of Statistics, 9, 130–134. Rubin, D.B. (1987).Multiple Imputationfor Nonresponse in Surveys. New York: Wiley. Rubin, D.B. (1993).Discussion: Statistical DisclosureLimitation. Journal of Official Statistics, 9, 462–468. Schafer, J.L. (1997). Analysis of IncompleteMultivariate Data. London: Chapman &Hall. Shen,Z.(2000). Nested Multiple Imputation. Ph.D. thesis,Harvard University, Department of Statistics. Woodcock, S.D. and Benedetto, G. (2006).Distribution-preservingStatistical Disclosure Limitation. Technical Report, Department of , Simon Fraser University.

Received February 2008 Revised June 2009