Efficient SQL Query Processing in Differentially Private Data Federations

Shrinkwrap: Efficient SQL Query Processing in Differentially Private Data Federations Johes Bater Xi He William Ehrich Northwestern University Duke University Northwestern University Ashwin Machanavajjhala Jennie Rogers Duke University Northwestern University ABSTRACT to any party involved in the query without the assistance of A private data federation is a set of autonomous databases a trusted data curator. that share a unified query interface offering in-situ evalua- A private data federation must provably ensure the fol- tion of SQL queries over the union of the sensitive data of lowing guarantee: a data owner should not be able to re- its members. Owing to privacy concerns, these systems do construct the database (or a part of it) held by other data not have a trusted data collector that can see all their data owners based on the intermediate result sizes of the query and their member databases cannot learn about individual evaluation. One way to achieve this without a trusted data records of other engines. Federations currently achieve this curator is to use secure computation protocols, which pro- goal by evaluating queries obliviously using secure multi- vide a strong privacy guarantee that intermediate results party computation. This hides the intermediate result car- leak no information about their inputs. dinality of each query operator by exhaustively padding it. Current implementations of private data federations that With cascades of such operators, this padding accumulates use secure computation have a performance problem: state- to a blow-up in the output size of each operator and a pro- of-the-art systems have slowdowns of 4{6 orders of magni- portional loss in query performance. Hence, existing private tude over non-private data federations [4, 56]. The source data federations do not scale well to complex SQL queries of this slowdown lies in the secure computation protocols over large datasets. used by private data federations that permit the secure eval- We introduce Shrinkwrap, a private data federation that uation of a SQL query without the need of trusted third offers data owners a differentially private view of the data party. A SQL query is executed as a directed acyclic graph held by others to improve their performance over oblivious of database operators. Each operator within the tree takes query processing. Shrinkwrap uses computational differen- one or two inputs, applies a function, and outputs its result tial privacy to minimize the padding of intermediate query to the next operator. In a typical (distributed) database results, achieving up to 35X performance improvement over engine, the execution time needed to calculate each inter- oblivious query processing. When the query needs differ- mediate result and the size of that result leaks information entially private output, Shrinkwrap provides a trade-off be- about the underlying data to curious data owners partici- tween result accuracy and query evaluation performance. pating in the secure computation. To address this leakage, private data federations insert dummy tuples to pad intermediate results to their maximum possible size thereby en- suring that execution time is independent of the input data. 1. INTRODUCTION With secure evaluation, query execution effectively runs in The storage and analysis of data has seen dramatic growth the worst-case, drastically increasing its computation costs in recent years. Organizations have never valued data more as intermediate result sizes grow. While performance is rea- arXiv:1810.01816v1 [cs.DB] 3 Oct 2018 highly. Unfortunately, this newfound value has attracted sonable for simple queries and small workloads, performance unwanted attention. Security and privacy breaches litter the is untenable for complex SQL queries with multiple joins. news headlines, and have engendered fear and a hesitance to Several approaches attempt to solve this performance prob- share data, even among trusted collaborators. Without data lem, but they fail to provably bound the information leaked sharing, information becomes siloed and enormous potential to a data owner. One line of research uses Trusted Exe- analytical value is lost. cution Environments (TEEs) that evaluate relational oper- Recent work in databases and cryptography attempts to ators within an on-chip secure hardware enclave [54, 56]. solve the data sharing problem by introducing the private TEEs efficiently protect query execution, but they require data federation [4]. A private data federation consists of specialized hardware from chip manufacturers. Moreover, a set of data owners who support a common relational current TEE implementations from both Intel and AMD database schema. Each party holds a horizontal partition do not adequately obscure computation, allowing observers (i.e., a subset of rows) of each table in the database. A pri- to obtain supposedly secure data through widely publicized vate data federation provides a SQL query interface to ana- hardware vulnerabilities [9, 17, 28, 31, 49]. Another ap- lysts (clients) over the union of the records held by the data proach selectively applies homomorphic encryption to evalu- owners. Query evaluation is performed securely from multi- ate relational operators in query trees, while keeping the un- ple data owners without revealing unauthorized information derlying tuples encrypted throughout the computation (e.g., 1 CryptDB [45]). While this system improves performance, it leaks too much information about the data, such as statistics Private Data and memory access patterns, allowing a curious observer to End User Owner deduce information about the true values of encrypted tu- SQL output of q ples [15, 26, 39, 7]. query q over all DBs In this work, we bridge the gap between provable secure secure query plan q systems with untenable performance and practical systems Query with no provable guarantees on leakage using differential Coordinator secret shares of q’s privacy [14]. Differential privacy is a state-of-the-art tech- output (1/data owner) nique to ensure privacy, and provides a provable guarantee Secure Computation that one can not reconstruct records in a database based on outputs of a differentially private algorithm. Differen- Figure 1: Private data federation architecture tially private algorithms, nevertheless, permit approximate aggregate statistics about the dataset to be revealed. We present Shrinkwrap, a system that improves private situ evaluation of SQL queries over the union of the sensitive data federation performance by carefully relaxing the pri- data of its members without revealing unauthorized infor- vacy guaranteed for data owners in terms of differential pri- mation to any party involved in the query. A private data vacy. Instead of exhaustively padding intermediate results federation has three types of parties: 1) two or more data to their worst-case sizes, Shrinkwrap obliviously eliminates owners DO1;:::; DOm that hold private data D1;:::;Dm dummy tuples according to tunable privacy parameters, re- respectively, and where all Di share a public schema of k re- ducing each intermediate cardinality to a new, differentially lations (R1;:::;Rk); 2) a query coordinator that plans and private value. The differentially private intermediate re- orchestrates SQL queries q on the data of the data owners; sult sizes are close to the true sizes, and thus, Shrinkwrap and 3) a client that writes SQL statements to the query co- ¯ achieves practical query performance. ordinator. The set of private data D = (D1;:::;Dm) owned To the best of our knowledge, Shrinkwrap is the first sys- by the data owners is a horizontal partition of every table tem for private data sharing that combines differential pri- in the total data set D. vacy with secure computation for query performance opti- As shown in Figure 1, the client first passes an SQL state- mization. The main technical contributions in this work are: ment q(·) to the query coordinator, who compiles the query • A query processing engine that offers controlled informa- into an optimized secure query plan to be executed by each tion leakage with differential privacy guarantees to speed data owner. Once the data owners finish execution, they up private data federation query processing and provides each pass their secret share of the output to the query coor- tunable privacy parameters to control the trade-off be- dinator, who assembles and returns the result to the client. tween privacy, accuracy, and performance 2.1 Privacy Goals and Assumptions • A computational differential privacy mechanism that se- Table 1 summarizes the privacy goals for all parties in- curely executes relational operators, while minimizing involved in this process and the necessary assumptions re- termediate result padding of operator outputs quired for achieving these goals. • A novel algorithm that optimally allocates, tracks, and Privacy Goals: Data owners require a privacy guarantee applies differential privacy across query execution that the other parties (data owners, query coordinator and • A protocol-agnostic cost model that approximates the the client) are not able to reconstruct the private data they large, and non-linear, computation overhead of secure hold based on intermediate results of the query execution. computation as it cascades up an operator tree In particular, we require data owners to have a differentially The rest of this paper is organized as follows. In Sec- private view over the inputs of other data owners (formally tion 2 we define private data federations, outline our privacy defined in Def. 2). We aim to support two policies for clients. goals and formally define secure computation and differen- Clients may be trusted (output policy 1) and allowed to tial privacy. Section 3 describes the problem addressed by see true answers to queries but must not learn any other Shrinkwrap. Our end-to-end solution, Shrinkwrap, and its information about the private inputs held by data owners. privacy guarantees, is described in Section 4. We show how Clients may be untrusted (output policy 2), in which case to optimize the performance of Shrinkwrap in Section 5. they only are permitted a differentially private view over the Section 6 describes how we implement specific secure com- inputs.

Load more