D. Data Swapping

Total Page:16

File Type:pdf, Size:1020Kb

D. Data Swapping

D. Data Swapping

1. An Overview of Data Swapping

Data swapping (Reiss 1984)1 is the principal method used by the Census Bureau for many of their published data products (e.g., Steel and Zayatz 1998).2 For an example of data swapping, suppose we have a table that includes several demographic variables and a geographical variable, like the one below.

Location Age Sex Candidate Pittsburgh Young M Bush Pittsburgh Young M Gore Pittsburgh Young F Gore Pittsburgh Old M Gore Pittsburgh Old M Bush Ann Arbor Young F Bush Ann Arbor Young F Gore Ann Arbor Old M Gore Ann Arbor Old F Bush Ann Arbor Old F Gore

Looking at Pittsburgh, there is a unique record for a young female, who voted for Gore. For every other Pittsburgh record, there is at least one comparable record that matches on age and sex, and furthermore, for each pair of age and sex, the candidate field is not unique. Thus knowing a subject's age and sex will not reveal candidate, except for the young female. We want to do something to prevent the disclosure of the young female's vote.

We can define the following terms:

Uniques Key: Variables that are used to identify records that pose a confidentiality risk.

Swapping Key: Variables that are used to identify records will be swapped.

Swapping Attribute: Variables over which swapping will occur.

Protected Variables: Other variables, which may or may not be sensitive.

For our example, we might choose [age, sex] as both the uniques key and the swapping key. These two keys may of course be different, but for this simple example we will take

1 Reiss, S., "Practical Data Swapping: The First Steps," ACM Transactions on Database Systems, 9, March 1984.

2 Steel, P. and L. Zayatz, "Disclosure Limitation for the 2000 Census of Housing and Population," in Statistical Data Protection: Proceedings of the Conference, Lisbon 25-27 March 1998, Eurostat, 1999. them to be the same. It is common to choose a geographical variable to be the swapping attribute, so if we make this choice, it leaves candidate as the sole protected variable.

The idea of swapping is to find a record, with a different value of the swapping attribute, that matches (on the swapping key) a record that is unique with respect to the uniques key. In our example, the young female in Pittsburgh should be matched with a young female in Ann Arbor. Two points are important here. First, having determined a unique record in Pittsburgh, we look for a match on the swapping key in a different geographical area (different value of the swapping attribute). Second, if no match exists, for any other value of the swapping key, further recoding of one or more attributes will be required to ensure a match.

We do indeed find two matches in Ann Arbor, so we can swap the Pittsburgh record with one of them. In general, one typically decides to perform a swap with some probability p, whose value is held confidential. If we go ahead and perform the swap, anyone looking at the table will still find a unique young female in Pittsburgh, but it is not at all certain that she actually voted for the candidate listed. Naturally, just as for the Doomsday Machine in "Dr. Strangelove", it is important that it be well advertised that swapping is a possibility! Advertising this is intended to (1) dissuade the data snooper from making a linkage, and (2) lower the public credibility in any claim by the data snooper of a valid identification.

Switching attention to Ann Arbor, we see that there is a single old male, so something needs to be done to protect this record. Just as before, we look for a match in a different geographical area, and find two from which to choose in Pittsburgh. Flipping a coin that comes up heads with probability p, we swap with one of the Pittsburgh records if it comes up heads.

In circumstances involving extremely sensitive data, or where respondents themselves might seek to compromise confidentiality, one might consider protecting not just unique records. If a cell in a cross-tabulation has a value of 1 (i.e., a unique record) or 2, then one respondent might look for the only other person who matches her on the identifying attributes, and so be able to infer sensitive details of that person. In general, the group releasing the data is free to choose any number n and protect cells whose count is between 0 and n.

In a similar vein, it is common to adjust the swapping probability p according to the relative size of the group sharing the value of the swapping attribute. In our example, if there were 10 data subjects in Ann Arbor, and 20 in Pittsburgh, the value of p would be larger for swaps of Ann Arbor uniques, since there is presumably greater disclosure danger associated with smaller populations. To modify p, choose your favorite monotone function.

A natural question at this point concerns the statistical downside of this procedure. It is clear that any analysis involving just the attributes in the swapping key will be completely unaffected, since these values do not change at all. But it is true that the protected variables may change in value. In looking at the example, it wouldn't make sense to, for instance, always try to choose a swapping pair that agreed on the value(s) of the protected variable(s). This defeats the purpose of the swapping. One could perhaps choose, if a choice existed, on the basis of the relative frequencies of the protected values in the two geographical locations. If these distributions are similar, then if we simply choose the record in the second area at random, we do not skew the distribution in either area very much (the means remain the same, although variance may be increased). If the two areas have markedly different distributions for the protected variables, it is not clear how to choose the swapping record without skewing one or the other of the distributions.

One might also try to find a geographical area that most nearly matches on the distribution of the protected variables in hopes of minimizing any distortion. Typically, the swapping attribute is chosen to be a hierarchical variable and swaps searched for "nearby" in the hierarchy. To go back to our example, losing the information that an old man's vote for Bush occurred in Pittsburgh as opposed to Ann Arbor will not seriously bias publishable results since the swap will only be made in a very minor number of cases. By carefully selecting the geographical units across which the swaps occur (e.g., units with similar distributions of the protected variables), the impact to analysts of local and regional data will be minimized.

Disclosure risk under data swapping can be analyzed (Duncan, Roehrig and Kannan 2000).3 Speaking generally, this risk decreases with increasing p, so if you can quantify your risk preferences, the value of p can be set to just exceed your threshold. It is also possible to analyze the tradeoff between data utility and disclosure risk, but again it is often difficult to quantify data utility. Which researchers and which studies are the most important? How does one tradeoff the inevitable inaccuracies of individual attribute values? There do not seem to be any specific instances where this has been done successfully.

There are, however, a number of reasons to choose a swapping approach over suppression or "blank and impute" schemes. First, suppression simply loses data, and the statistical effects can be large. A data user interested in estimating some quantity that is a function of suppressed values must make an assumption about those missing values. Two common assumptions are (1) the values are zero and (2) the values are estimated using the marginal values (where they are provided). In the first case, there is an obvious bias downward. In the second case, there may also be a large bias if the suppressed data are "special" in some way. A study by Census personnel (Navarro, Flores-Baez and Thompson 1988)4 compared swapping with suppression under both of these assumptions and showed that swapping typically resulted in less than one fourth the distortion (as measured by dissimilarity from the true data using the D-statistic) of suppression.

3 Duncan, G., S. Roehrig and K. Kannan, "Final Report on the American FactFinder Disclosure Audit Project for the U. S. Census Bureau," Heinz School Technical Report, Carnegie Mellon University, 2000.

4 Navarro, A., L. Flores-Baez and J. Thompson, "Results of Data Switching Simulation," paper delivered to the American Statistical Association and Population Statistics Advisory Committees, 1988. Blanking and imputation is another technique that might be used with the TEDS data. Here cells are selected at random and "blanked out," then filled again with imputed values. Depending on the imputation scheme, this is similar to random cell suppression, except it is the data provider rather than the user that determines the appropriate estimates for the "suppressed" cells. While the Navarro et al. study mentioned above did not look directly at blank and impute, it is true that imputation is most typically done using estimates (averages) from the complete data, and so could avoid biases arising from the use of partial (i.e., suppressed) tables. However, distortions to the data—the inevitable movement of the reported data to average values—makes blank and impute less desirable than swapping.

Although the swapping methodology is in some ways analogous to the "hot deck method" for the imputation of missing data, swapping always preserves the observed marginal frequencies for individual variables (e.g. pregnant women) within the limits imposed by the swapping rules. It does not, however, preserve detailed associations in the observed data (e.g. that an old man in Ann Arbor voted for Gore and an old man in Pittsburgh voted for Bush). What we retain is that among old men in Ann Arbor and Pittsburgh, one voted for Gore and one for Bush. Since the uniqueness of multivariate characteristics (here possibly the geographical specificity for the old men's voting behavior) that will be distorted by the swapping at some rate p involves rare combinations of characteristics, the level of multivariate distortion involved will not harm true analysts in their use of the data.

Even an analyst who chooses to focus on voting in Ann Arbor will need to aggregate results across ages and genders or other subclasses to obtain enough cases to achieve reasonable precision for their results. In a practical sense, the information that one or two old men in Ann Arbor voted a particular way is only of interest to the data snooper and not the true analyst. In addition to the small numbers of cases where multivariate relationships may be changed, the possibility that the swap will be "like for like" attenuates the bias that can result.

Recommended publications