Blowfish Privacy: Tuning Privacy-Utility Trade-Offs Using Policies

Blowfish Privacy: Tuning Privacy-Utility Trade-offs using Policies Xi He Ashwin Machanavajjhala Bolin Ding Duke University Duke University Microsoft Research Durham, NC, USA Durham, NC, USA Redmond, WA, USA [email protected] [email protected] [email protected] ABSTRACT for the utility (or accuracy) of data analysis (see [4] for a Privacy definitions provide ways for trading-off the privacy survey). Differential privacy [6] has emerged as a gold stan- of individuals in a statistical database for the utility of down- dard not only because it is not susceptible to attacks that stream analysis of the data. In this paper, we present Blow- other definition can't tolerate, but also since it provides a fish, a class of privacy definitions inspired by the Pufferfish simple knob, namely , for trading off privacy for utility. framework, that provides a rich interface for this trade-off. While is intuitive, it does not sufficiently capture the In particular, we allow data publishers to extend differential diversity in the privacy-utility trade-off space. For instance, privacy using a policy, which specifies (a) secrets, or informa- recent work has shown two seemingly contradictory results. tion that must be kept secret, and (b) constraints that may In certain applications (e.g., social recommendations [17]) be known about the data. While the secret specification differential privacy is too strong and does not permit suffi- allows increased utility by lessening protection for certain cient utility. Next, when data are correlated (e.g., when con- individual properties, the constraint specification provides straints are known publicly about the data, or in social net- added protection against an adversary who knows correla- work data) differentially private mechanisms may not limit tions in the data (arising from constraints). We formalize the ability of an attacker to learn sensitive information [12]. policies and present novel algorithms that can handle general Subsequently, Kifer and Machanavajjhala [13] proposed a specifications of sensitive information and certain count con- semantic privacy framework, called Pufferfish, which helps straints. We show that there are reasonable policies under clarify assumptions underlying privacy definitions { specif- which our privacy mechanisms for k-means clustering, his- ically, the information that is being kept secret, and the tograms and range queries introduce significantly lesser noise adversary's background knowledge. They showed that dif- than their differentially private counterparts. We quantify ferential privacy is equivalent to a specific instantiation of the privacy-utility trade-offs for various policies analytically the Pufferfish framework, where (a) every property about and empirically on real datasets. an individual's record in the data is kept secret, and (b) the adversary assumes that every individual is independent of Categories and Subject Descriptors the rest of the individuals in the data (no correlations). We H.2.8 [Database Applications]: Statistical Databases; K.4.1 believe that these shortcomings severely limit the applicabil- [Computers and Society]: Privacy ity of differential privacy to real world scenarios that either require high utility, or deal with correlated data. Keywords Inspired by Pufferfish, we seek to better explore the trade- off between privacy and utility by providing a richer set of privacy, differential privacy, Blowfish privacy \tuning knobs". We explore a class of definitions called Blow- fish privacy. In addition to , which controls the amount of 1. INTRODUCTION information disclosed, Blowfish definitions take as input a With the increasing popularity of \big-data" applications privacy policy that specifies two more parameters { which which collect, analyze and disseminate individual level in- information must be kept secret about individuals, and what formation in literally every aspect of our life, ensuring that constraints may be known publicly about the data. By ex- these applications do not breach the privacy of individuals is tending differential privacy using these policies, we can hope an important problem. The last decade has seen the devel- to develop mechanisms that permit more utility since not all opment of a number of privacy definitions and mechanisms properties of an individual need to be kept secret. Moreover, that trade-off the privacy of individuals in these databases we also can limit adversarial attacks that leverage correlations due to publicly known constraints. Permission to make digital or hard copies of all or part of this work for We make the following contributions in this paper: personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear • We introduce and formalize sensitive information speci- this notice and the full citation on the first page. Copyrights for components fications, constraints, policies and Blowfish privacy. We of this work owned by others than ACM must be honored. Abstracting with consider a number of realistic examples of sensitive in- credit is permitted. To copy otherwise, or republish, to post on servers or to formation specification, and focus on count constraints. redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. • We show how to adapt well known differential privacy SIGMOD ’14 Snowbird, Utah USA mechanisms to satisfy Blowfish privacy, and using the Copyright 2014 ACM 978-1-4503-2376-5/14/06 ...$15.00. example of k-means clustering illustrate the gains in ac- http://dx.doi.org/10.1145/2588555.2588581. curacy for Blowfish policies having weaker sensitive in- Definition 2.3. The Laplace mechanism, M Lap, privately d formation specifications. computes a function f : In ! R by computing f(D) + η. d • We propose the ordered mechanism, a novel strategy η 2 R is a vector of independent random variables, where for releasing cumulative histograms and answering range each ηi is drawn from the Laplace distribution with parame- −z·/S(f) queries. We show analytically and using experiments on ter S(f)/. That is, P [ηi = z] / e . real data that, for reasonable sensitive information spec- Given some partitioning of the domain P = (P1;:::;Pk), k ifications, the ordered hierarchical mechanism is more we denote by hP : I! Z the histogram query. hP (D) accurate than the best known differentially private mech- outputs for each Pi the number of times values in Pi appears anisms for these workloads. in D. hT (·) (or h(·) in short) is the complete histogram query • We study how to calibrate noise for policies expressing that reports for each x 2 T the number of times it appears count constraints, and its applications in several practi- in D. It is easy to see that S(hP ) = 2 for all histogram cal scenarios. queries, and the Laplace mechanism adds noise proportional Organization: Section 2 introduces the notation. Section 3 to Lap(2/) to each component of the histogram. We will formalizes privacy policies. We define Blowfish privacy, and use Mean Squared Error as a measure of accuracy/error. discuss composition properties and its relationship to prior Definition 2.4. Let M be a randomized algorithm that work in Section 4. We define the policy specific global sen- d privately computes a function f : In ! R . The expected sitivity of queries in Section 5. We describe mechanisms mean squared error of M is given by: for kmeans clustering (Section 6), and releasing cumula- X ~ 2 tive histograms & answering range queries (Section 7) under EM (D) = E(fi(D) − fi(D)) (3) Blowfish policies without constraints and empirically evalu- i ate the resulting privacy-utility trade-offs on real datasets. th where fi(·) and f~i(·) denote the i component of the true We show how to release histograms in the presence of count and noisy answers, respectively. constraints in Section 8 and then conclude in Section 9. Under this definition the accuracy of the Laplace mechanism 2 2 2. NOTATION for histograms is given by jT j · E(Laplace(2/)) = 8jT j/ . We consider a dataset D consisting of n tuples. Each tuple 3. POLICY DRIVEN PRIVACY t is considered to be drawn from a domain T = A1 × A2 × In this section, we describe an abstraction called a policy :::×Am constructed from the cross product of m categorical attributes. We assume that each tuple t corresponds to the that helps specify which information has to be kept secret data collected from a unique individual with identifier t: id. and what background knowledge an attacker may possess We will use the notation x 2 T to denote a value in the about the correlations in the data. We will use this policy th specification as input in our privacy definition, called Blow- domain, and x:Ai to denote the i attribute value in x. Throughout this paper, we will make an assumption that fish, described in Section 4. the set of individuals in the dataset D is known in advance 3.1 Sensitive Information to the adversary and does not change. Hence we will use 2 the indistinguishability notion of differential privacy [7]. We As indicated by the name, Blowfish privacy is inspired by the Pufferfish privacy framework [13]. In fact, we will show will denote the set of possible databases using In, or the set of databases with jDj = n.1 later (in Section 4.2) that Blowfish privacy is equivalent to specific instantiations of semantic definitions arising from Definition 2.1 (Differential Privacy [6]). Two the Pufferfish framework. Like Pufferfish, Blowfish privacy also uses the notions of datasets D1 and D2 are neighbors, denoted by (D1;D2) 2 N, if they differ in the value of one tuple. A randomized secrets and discriminative pairs of secrets. We define a se- mechanism M satisfies -differential privacy if for every set cret to be an arbitrary propositional statement over the val- of outputs S ⊆ range(M), and every pair of neighboring ues in the dataset.

Load more