Efficient SQL Query Processing in Differentially Private Data Federations

Shrinkwrap: Efficient SQL Query Processing in Differentially Private Data Federations Johes Bater Xi He William Ehrich Northwestern University University of Waterloo Northwestern University [email protected] [email protected] [email protected] Ashwin Machanavajjhala Jennie Rogers Duke University Northwestern University [email protected] [email protected] ABSTRACT attention. Data breaches litter the news headlines, creat- A private data federation is a set of autonomous databases ing fear and a hesitance to share data, even among trusted that share a unified query interface offering in-situ evalua- collaborators. Without data sharing, information becomes tion of SQL queries over the union of the sensitive data of siloed and enormous potential analytical value is lost. its members. Owing to privacy concerns, these systems do Recent work in databases and cryptography attempts to not have a trusted data collector that can see all their data solve the data sharing problem by introducing the private and their member databases cannot learn about individual data federation [4]. A private data federation consists of records of other engines. Federations currently achieve this a set of data owners who support a common relational goal by evaluating queries obliviously using secure multi- database schema. Each party holds a horizontal partition party computation. This hides the intermediate result car- (i.e., a subset of rows) of each table in the database. A dinality of each query operator by exhaustively padding it. private data federation provides a SQL query interface to With cascades of such operators, this padding accumulates analysts (clients) over the union of the records held by the to a blow-up in the output size of each operator and a pro- data owners. Query evaluation is performed securely from portional loss in query performance. Hence, existing private multiple data owners without revealing unauthorized infor- data federations do not scale well to complex SQL queries mation to any party involved in the query and without the over large datasets. assistance of a trusted data curator. We introduce Shrinkwrap, a private data federation that A private data federation must provably ensure the fol- offers data owners a differentially private view of the data lowing guarantee: a data owner should not be able to re- held by others to improve their performance over oblivious construct the database (or a part of it) held by other data query processing. Shrinkwrap uses computational differen- owners based on the intermediate result sizes of the query tial privacy to minimize the padding of intermediate query evaluation. One way to achieve this without a trusted data results, achieving up to a 35X performance improvement curator is to use secure computation protocols, which pro- over oblivious query processing. When the query needs dif- vide a strong privacy guarantee that intermediate results ferentially private output, Shrinkwrap provides a trade-off leak no information about their inputs. between result accuracy and query evaluation performance. Current implementations of private data federations that use secure computation have a performance problem: state- PVLDB Reference Format: of-the-art systems have slowdowns of 4{6 orders of magni- Johes Bater, Xi He, Willi Ehrich, Ashwin Machanavajjhala, Jen- tude over non-private data federations [4, 56]. This slow- nie Rogers. Shrinkwrap: Differentially-Private Query Processing down comes from the secure computation protocols used by in Private Data Federations. PVLDB, 12(3): 307-320, 2018. private data federations to securely evaluate SQL queries DOI: https://doi.org/10.14778/3291264.3291274 without the need of a trusted third party. A query is executed as a directed acyclic graph of database operators. 1. INTRODUCTION Each operator takes one or two inputs, applies a function, and outputs its result to the next operator. In a typical The storage and analysis of data has seen dramatic growth (distributed) database engine, the execution time needed to in recent years. Organizations have never valued data more compute each intermediate result and the size of that result highly. Unfortunately, this value has attracted unwanted leaks information about the underlying data to curious data owners participating in the secure computation. To prevent this leakage, private data federations insert dummy tuples This work is licensed under the Creative Commons Attribution- to pad intermediate results to their maximum possible size, NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For thereby ensuring that execution time is independent of the any use beyond those covered by this license, obtain permission by emailing input data. With secure evaluation, query execution runs [email protected]. Copyright is held by the owner/author(s). Publication rights in the worst-case, drastically increasing computation costs licensed to the VLDB Endowment. as intermediate result sizes grow. While performance is rea- Proceedings of the VLDB Endowment, Vol. 12, No. 3 sonable for simple queries and small workloads, performance ISSN 2150-8097. is untenable for complex SQL queries with multiple joins. DOI: https://doi.org/10.14778/3291264.3291274 307 Several approaches attempt to solve this performance problem, but they fail to provably bound the information leaked Private Data to a data owner. One line of research uses Trusted Exe- End User Owner cution Environments (TEEs) that evaluate relational oper- SQL output of q ators within an on-chip secure hardware enclave [54, 56]. query q over all DBs TEEs efficiently protect query execution, but they require secure query plan q specialized hardware from chip manufacturers. Moreover, Query current TEE implementations from both Intel and AMD Coordinator secret shares of q’s do not adequately obscure computation, allowing observers output (1/data owner) to obtain supposedly secure data through widely publicized Secure Computation hardware vulnerabilities [9, 17, 28, 31, 49]. Another ap- proach selectively applies homomorphic encryption to evalu- Figure 1: Private data federation architecture ate relational operators in query trees, while keeping the underlying tuples encrypted throughout the computation (e.g., Section 6 describes how we implement specific secure com- CryptDB [45]). While this system improves performance, it putation protocols on top of Shrinkwrap's protocol agnostic leaks too much information about the data, such as statistics design. We experimentally evaluate our system implementa- and memory access patterns, allowing a curious observer to tion over real-world medical data in Section 7. We conclude deduce information about the true values of encrypted tu- with a survey of related work and future directions. ples [15, 26, 39, 7]. In this work, we bridge the gap between provable secure 2. PRIVATE DATA FEDERATION systems with untenable performance and practical systems In this section, we formally define a private data feder- with no provable guarantees on leakage using differential ation (PDF), describe privacy goals and assumptions, and privacy [14]. Differential privacy is a state-of-the-art tech- define two security primitives { secure computation and dif- nique to ensure privacy, and provides a provable guarantee ferential privacy. that one can not reconstruct records in a database based A private data federation is a collection of autonomous on outputs of a differentially private algorithm. Differen- databases that share a unified query interface for offering in- tially private algorithms, nevertheless, permit approximate situ evaluation of SQL queries over the union of the sensitive aggregate statistics about the dataset to be revealed. data of its members without revealing unauthorized infor- We present Shrinkwrap, a system that improves private mation to any party involved in the query. A private data data federation performance by carefully relaxing the pri- federation has three types of parties: 1) two or more data vacy guaranteed for data owners in terms of differential pri- owners DO1;:::; DOm that hold private data D1;:::;Dm vacy. Instead of exhaustively padding intermediate results respectively, and where all Di share a public schema of k re- to their worst-case sizes, Shrinkwrap obliviously eliminates lations (R1;:::;Rk); 2) a query coordinator that plans and dummy tuples according to tunable privacy parameters, re- orchestrates SQL queries q on the data of the data owners; ducing each intermediate cardinality to a new, differentially and 3) a client that writes SQL statements to the query co- private value. The differentially private intermediate re- ordinator. The set of private data D¯ = (D1;:::;Dm) owned sult sizes are close to the true sizes, and thus, Shrinkwrap by the data owners is a horizontal partition of every table achieves practical query performance. in the total data set D. To the best of our knowledge, Shrinkwrap is the first sys- As shown in Figure 1, the client first passes an SQL state- tem for private data sharing that combines differential pri- ment q(·) to the query coordinator, who compiles the query vacy with secure computation for query performance opti- into an optimized secure query plan to be executed by each mization. The main technical contributions in this work are: data owner. Once the data owners finish execution, they • A query processing engine that offers controlled informa- each pass their secret share of the output to

Load more