The Boundary Between Privacy and Utility in Data Publishing∗

The Boundary Between Privacy and Utility in Data Publishing∗ Vibhor Rastogi Dan Suciu Sungho Hong ABSTRACT Age Nationality Score Age Nationality Score 25 British 99 21-30 * 99 We consider the privacy problem in data publishing: given a 27 British 97 21-30 * 97 database instance containing sensitive information “anonymize” 21 Indian 82 21-30 * 82 it to obtain a view such that, on one hand attackers cannot 32 Indian 90 31-40 * 90 learn any sensitive information from the view, and on the 33 American 94 31-40 * 94 other hand legitimate users can use it to compute useful 36 American 94 31-40 * 94 statistics. These are conflicting goals. In this paper we (a) Test Scores (b) 2-diversity and prove an almost crisp separation of the case when a use- 3-anonymity [13] ful anonymization algorithm is possible from when it is not, based on the attacker’s prior knowledge. Our definition of Age Nationality Score privacy is derived from existing literature and relates the 25 British 99 attacker’s prior belief for a given tuple t, with the pos- 21 British 99 terior belief for the same tuple. Our definition of utility Age Nationality Score 22 Indian 89 is based on the error bound on the estimates of counting 25 British 99 32 Indian 90 queries. The main result has two parts. First we show that 28 Indian 99 28 Indian 99 if the prior beliefs for some tuples are large then there exists 29 American 81 29 American 81 no useful anonymization algorithm. Second, we show that 32 Indian 90 33 American 94 when the prior is bounded for all tuples then there exists 39 American 84 27 American 94 an anonymization algorithm that is both private and useful. 32 Indian 89 32 British 83 The anonymization algorithm that forms our positive result (c) FRAPP [3] 36 American 94 is novel, and improves the privacy/utility tradeoff of previ- 26 American 99 ously known algorithms with privacy/utility guarantees such 39 Indian 94 as FRAPP. (d) αβ algorithm 1. INTRODUCTION Table 1: Different data publishing methods The need to preserve private information while publishing data for statistical processing is a widespread problem. By studying medical data, consumer data, or insurance data, the utility of any anonymization method, as a function of analysts can often derive very valuable statistical facts, some- the attacker’s background knowledge. When the attacker times benefiting the society at large, but the concerns about has too much knowledge, we show that no anonymization individual privacy prevents the dissemination of such databases. method can achieve both. We therefore propose to study Clearly, any anonymization method needs to trade off anonymization methods that can protect against bounded between privacy and utility: removing all items from the adversary, which have some, but only limited knowledge. database achieves perfect privacy, but total uselessness, while We show that in this case the tradeoff can be achieved. publishing the entire data unaltered is at the other extreme. In this paper we study the tradeoff between the privacy and 1.1 Basic Concepts Throughout the paper we will denote with I the database ∗This work was partially supported by NSF Grants IIS- 0415193, IIS-0627585, and ITR IIS-0428168 instance containing the private data. I is a collection of records, and we denote n = |I| its cardinality. We also denote D the domain of all possible tuples, i.e. I ⊆ D and Permission to copy without fee all or part of this material is granted provided denote m = |D|. For example, the data instance I in Fig 1 that the copies are not made or distributed for direct commercial advantage, (a) consists of n = 6 records representing test scores, and the VLDB copyright notice and the title of the publication and its date appear, D is the domain of all triples (x, y, z) where x ∈ [16, 35] and notice is given that copying is by permission of the Very Large Data is an age, y is a nationality (say, from a list of 50 valid Base Endowment. To copy otherwise, or to republish, to post on servers nationalities), and z ∈ [1, 100] is a test score, thus m = 20 · or to redistribute to lists, requires a fee and/or special permission from the 50 · 100 = 100, 000. The goal of an anonymization algorithm publisher, ACM. VLDB ‘07, September 23-28, 2007, Vienna, Austria. is to compute a new data instance V from I that offers both Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09. privacy (hides the tuples in I) and utility. The notion of 531 Definition Meaning D Domain of tuples I Database instance I ⊆ D V Published view V ⊆ D n Database size n = |I| m Domain size m = |D|, n m n k Attacker’s prior prior of tuple t, Pr[t] ≤ k m γ Attacker’s posterior posterior Pr[t|V ] ≤ γ √ ρ Query estimation error |Q(I) − Q˜(V )| ≤ ρ n Table 2: Symbols used in the paper; formal defini- tions in Sec. 2. privacy that we use in this paper compares the attacker’s prior probability of a tuple belonging to I, Pr[t] = Pr[t ∈ I] Figure 1: Data anonymization: (1) Local perturba- with the a posteriori probability, Pr[t|V ] = Pr[t ∈ I|V ]. tion (2) Data publishing (3) Output perturbation. If the algorithm is private, then in particular: if Pr[t] is low, Pr[t|V ] must also be low. It makes sense to express the n attacker’s prior as Pr[t] = k m , for some k, since the database contains n tuples out of m. We also denote γ a bound on the algorithms. All results in this paper are for data publishing. a posteriori. The notion of utility that we use in this paper In output perturbation, (3), the data is kept on a trusted cen- measures how well a user can estimate counting queries. An tral server that accepts user queries, evaluates them locally, example of a counting query Q for the data in Fig 1 (a) is then returns a perturbed answer. The best privacy/utility count the number people from poor African countries that tradeoffs can be achieved in this setting, but the cost is that received a score > 98: assuming the user has a list of poor the user can never see the data directly, and can access it African countries, the answer of Q on I, denoted Q(I), can only through the interaction with the server, which often is simply be obtained by scanning the table I and counting limited in what kinds of queries it supports. Some output how many records satisfy the predicate. But the user doesn’t perturbation algorithms place a bound on the number of have access to I, only to V , hence she will compute instead queries, i.e. the server counts how many queries it answers an estimate, Q˜(V ), over V . Our definition of utility consists on behalf of a user and stops once a limit has been reached. in a guaranteed upper bound on the absolute error of the Referring to Fig. 1 we note that all “positive” results ex- estimator, more precisely, it is given by a parameter ρ s.t. tend automatically to the right (e.g. any local perturbation ˜ √ algorithm can also be used in data publishing), while all |Q√(I) − Q(V )| ≤ ρ n. (We explain the particular choice of ρ n in a moment.) These parameters are summarized in “negative” results extend to the left. Table 2, and defined formally in Sec. 2. The relevant results from the related work in the literature are captured in Table 3, where we attempted to present them 1.2 Our Main Results in a unified framework. We explain now the positive and negative results. In this paper we√ prove two complementary main results. Positive results There are several randomized algorithms First, when k = Ω( m) then no algorithm can achieve both for local perturbation [1, 9, 10, 18, 2, 15, 3]. Each tuple in utility and privacy (Theorem 3.3). In other words, if the the database I is locally perturbed by replacing it with a adversary is powerful, i.e. his prior is Pr[t] = k n = Ω( √n ) m m randomly chosen tuple according to a predefined probabil- for some tuples t, then no algorithm can achieve both privacy ity distribution. FRAPP [3] is a framework that generalizes and utility. This justifies us to consider adversaries that all algorithms in this class, and also achieves the (provably) are bounded in power, i.e. their prior is bounded. Second, best privacy/utility in this class: ρ = k/γ. Note that our when k = O(1), equivalently when the adversary’s prior algorithm improves this, by reducing ρ to ρ = pk/γ (note is bounded by Pr[t] ≤ k n = O( n ), then we describe an m m p that k/γ > 1 because k > 1 and γ < 1). The reason we algorithm that achieves a utility of ρ = k/γ (Theorems 4.3 could improve the utility is that we use the power of the & 4.4). Here, and throughout the paper, the notations O(−) trusted server to make better perturbation in the data. In or Ω(−) refer to asymptotic functions in n and m. data publishing, virtually all techniques described in the literature are variations of the k-anonymity method (e.g. [19, 1.3 Related Work 13, 17, 20]), which do not guarantee the privacy and/or util- We discuss now the related work in order to place our re- ity, and, hence, do not relate directly to our results.

Load more