Djoin: Differentially Private Join Queries Over Distributed Databases

DJoin: Differentially Private Join Queries over Distributed Databases Arjun Narayan Andreas Haeberlen University of Pennsylvania University of Pennsylvania Abstract sult of others, it is possible to give a strong upper bound on how much an adversary could learn about an individ- In this paper, we study the problem of answering queries ual person’s data, even under worst-case assumptions. about private data that is spread across multiple different Several differentially private query processors, includ- databases. For instance, a medical researcher may want ing PINQ [23], Airavat [32], Fuzz [16], and PDDP [6], to study a possible correlation between travel patterns have been developed and are available today. and certain types of illnesses. The necessary informa- However, existing query processors assume either tion exists today – e.g., in airline reservation systems that all the data is available in a single database [16, 23, and hospital records – but it is maintained by two sepa- 32] or that distributed queries can be broken into sev- rate companies who are prevented by law from sharing eral subqueries that can each be answered using only this information with each other, or with a third party. one of the databases [6, 10, 15, 31]. In practice, this is This separation prevents the processing of such queries, not necessarily the case. For instance, suppose a medical even if the final answer, e.g., a correlation coefficient, researcher wanted to study how a certain illness is cor- would be safe to release. related with travel to a particular region. This data may We present DJoin, a system that can process such dis- be available, e.g., in a hospital database H and an airline tributed queries and can give strong differential privacy reservation system R, but to determine the correlation, guarantees on the result. DJoin can support many SQL- it is necessary to join the two databases together – for style queries, including joins of databases maintained by instance, we must count the individuals who have been different entities, as long as they can be expressed using treated for the illness (according to H) and have traveled DJoin’s two novel primitives: BN-PSI-CA, a differen- to the region (according to R). tially private form of private set intersection cardinal- We are not aware of any existing method or query ity, and DCR, a multi-party combination operator that processor that can efficiently support join queries with can aggregate noised cardinalities without compounding differential privacy guarantees. Joins cannot be bro- the individual noise terms. Our experimental evaluation ken into smaller subqueries on individual databases be- shows that DJoin can process realistic queries at prac- cause, in order to match up the same persons’ data tical timescales: simple queries on three databases with in the two databases, such queries would have to ask 15,000 rows each take between 1 and 7.5 hours. about individual rows, which is exactly what differential privacy is designed to prevent. In principle, one 1 Introduction could process joins using secure multi-party computa- tion (MPC) [38], but MPC is only practical for small A vast amount of information is constantly accumu- computational tasks, and differential privacy only works lating in databases (social networks, hospital records, well for large databases. The cost of an entire join under airline reservation systems, etc.) all around the world. MPC would be truly spectacular. There are many good uses to which this data could po- DJoin, the system we present in this paper, is a so- tentially be put; however, much of this data is sensitive lution to this problem. DJoin can support SQL-style and cannot safely be released because of privacy con- queries across multiple databases, including common cerns. Simple solutions, such as anonymizing or aggre- forms of joins. The key insight behind DJoin is that the gating the data before release, are not reliable; experi- distributed parts of many queries can be expressed as ence with cases like the Netflix prize [3] or the AOL intersections of sets or multisets. For instance, we can search data [2] shows that such data can sometimes be rewrite the query from above to locally select all patients de-anonymized with auxiliary information [26]. with the illness from H and all travelers to the relevant Differential privacy [7] has been proposed as a way to region from R, then intersect the resulting sets, and fi- solve this problem. By disallowing certain queries, and nally count the number of elements in the intersection. by adding a carefully chosen amount of noise to the re- Not all SQL queries can be rewritten in this way, but 1 USENIX Association 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12) 149 many counting queries can: conjunctions and disjunc- tacker can learn about any individual, even with access tions of equality tests directly correspond to unions and to auxiliary information. intersections of data elements. As we will show, a num- Differentially private query processors: PINQ [23], ber of additional operations, such as inequalities and nu- Airavat [32], and Fuzz [16] are query processors that meric comparisons, can be expressed in terms of multi- support differential privacy, but they assume a central- set operations. ized setting in which a single entity has access to the en- Protocols for private set operations have been stud- tire data. We are aware of five solutions for distributed ied by cryptographers for some time [14, 17, 37], but settings [6, 10, 15, 31, 33], but these assume that the data existing solutions compute exact set elements or exact is horizontally partitioned (i.e., each individual’s data is cardinalities, which is not compatible with differential completely contained in one of the databases), and that privacy. We present blinded, noised private set inter- the query can be factored into subqueries that are each section cardinality (BN-PSI-CA), an extension of the local to a single database. For instance, [10] computes set-intersection protocol from [17] that supports private queries of the form ∑i f (di), i.e., the sum over all rows i noising, as well as denoise-combine-renoise (DCR), an in the database after applying a function f to each row. operator that can add or subtract multiple noised sub- DJoin’s data model is more general: multiple databases set cardinalities without compounding the correspond- may contain data for a given individual, and queries ing noise terms. DCR relies on MPC to remove the noise can contain joins. We note that some of the other sys- terms on its inputs and to re-noise the output, but DCR’s tems have far more sophisticated query languages, but complexity grows with the number of parties and not we speculate that DJoin’s rewriting and execution en- with the number of elements in the sets. For the queries gine could be integrated with existing systems, e.g., with we tried, this step never took more than 20 seconds. PINQ or Fuzz. We have implemented and evaluated a prototype of Private set operations: The first protocols for private DJoin. Our results show that the costs are substantial but two-party set intersection and set intersection cardinal- typically feasible. For instance, the elements in a sim- ity were proposed by Freedman et al. [14]. Since then, ple two-way join on databases with 32,000 rows each a number of improvements have been proposed; for in- can be evaluated in about 1.8 hours, with 83 MB of stance, Kissner and Song [17] extended the protocols to traffic, using a single commodity workstation for each multiple parties, and Vaidya and Clifton [37] reduced database. This is orders of magnitude faster than gen- the computational overhead. These protocols produce eral MPC. DJoin’s cost is too high for interactive use, exact results, and are thus not directly suitable for dif- but it seems practical for applications that can tolerate a ferential privacy. There are specialized protocols for certain amount of latency, such as research studies. Our other private multi-party operations, e.g., for decision- algorithms are easy to parallelize, so the speed could be tree learning [29], and some of these have been adapted improved by increasing the number of cores. for differential privacy, e.g., [39]. To summarize, this paper makes the following four Computational differential privacy: The standard def- contributions: inition of differential privacy is information-theoretic, i.e., it holds even against a computationally unbounded two new primitives, BN-PSI-CA and DCR, for dis- • adversary. In contrast, DJoin provides computational tributed private query processing (Section 4); differential privacy [25]: it relies on a homomorphic a query planner that rewrites SQL-style queries to • cryptosystem and thus depends on certain computa- take advantage of those two primitives (Section 5); tional hardness assumptions. Mironov et al. [25] demon- the design of DJoin, an engine for distributed, dif- strated a protocol for this model that privately approxi- • ferentially private queries (Section 6); and mates the Hamming distance between two vectors in a an experimental evaluation of DJoin, based on a two-party setting. This problem is closely related to that • prototype implementation (Section 7). of computing the cardinality of set intersections, which is solved by BN-PSI-CA. Untrusted servers: Several existing systems enable 2 Related work clients to use an untrusted server without exposing DJoin provides differential privacy [7, 8, 9, 11], which private information to that server. In SUNDR [19], is one of the strongest privacy guarantees that have been SPORC [13], and Depot [22], the server provides stor- proposed so far. Alternatives include randomization [1], age; in CryptDB [30], it implements a database and k-anonymity [34], and l-diversity [21], which are gener- SQL-style queries.

Load more