SAQE: Practical Privacy-Preserving Approximate Query Processing for Data Federations
Total Page:16
File Type:pdf, Size:1020Kb
SAQE: Practical Privacy-Preserving Approximate Query Processing for Data Federations Johes Bater Yongjoo Park Xi He Northwestern University University of Illinois (UIUC) University of Waterloo [email protected] [email protected] [email protected] Xiao Wang Jennie Rogers Northwestern University Northwestern University [email protected] [email protected] ABSTRACT 1. INTRODUCTION A private data federation enables clients to query the union of data Querying the union of multiple private data stores is challeng- from multiple data providers without revealing any extra private ing due to the need to compute over the combined datasets without information to the client or any other data providers. Unfortu- data providers disclosing their secret query inputs to anyone. Here, nately, this strong end-to-end privacy guarantee requires crypto- a client issues a query against the union of these private records graphic protocols that incur a significant performance overhead as and he or she receives the output of their query over this shared high as 1,000× compared to executing the same query in the clear. data. Presently, systems of this kind use a trusted third party to se- As a result, private data federations are impractical for common curely query the union of multiple private datastores. For example, database workloads. This gap reveals the following key challenge some large hospitals in the Chicago area offer services for a cer- in a private data federation: offering significantly fast and accurate tain percentage of the residents; if we can query the union of these query answers without compromising strong end-to-end privacy. databases, it may serve as invaluable resources for accurate diag- To address this challenge, we propose SAQE, the Secure Ap- nosis, informed immunization, timely epidemic control, and so on. proximate Query Evaluator, a private data federation system that Yet, their databases stay siloed by default; these hospitals (i.e., data scales to very large datasets by combining three techniques — dif- providers) are reticent to share their data with one another, fearing ferential privacy, secure computation, and approximate query pro- the sensitive health records of their patients may be breached in cessing — in a novel and principled way. First, SAQE adds novel the process of carelessly data sharing. To address this, we need a secure sampling algorithms into the federation’s query processing principled approach to querying over multiple private datastores. pipeline to speed up query workloads and to minimize the noise the Private Data Federations. A private data federation [9, 58] is system must inject into the query results to protect the privacy of the a database system that offers end-to-end privacy guarantees for data. Second, we introduce a query planner that jointly optimizes querying the union of multiple datastores. In other words, a private the noise introduced by differential privacy with the sampling rates data federation ensures that the secret inputs of each data provider and resulting error bounds owing to approximate query processing. are accessible only to him or her 1) before, 2) during, and 3) af- Our research shows that these three techniques are synergistic: ter query processing. Before query processing, the system makes sampling within certain accuracy bounds improves both query pri- all query optimization decisions in a data-independent manner; no vacy and performance, meaning that SAQE executes over less data one can learn anything about the data by examining the query’s op- than existing techniques without sacrificing efficiency, privacy, or erators or parameters. During query processing, the system uses accuracy. Using our optimizer, we leverage this counter-intuitive secure computation protocols that compute over the private data of result to identify an inflection point that maximizes all three crite- multiple parties such that none of them learn the secret inputs of ria prior query evaluation. Experimentally, we show that this result their peers; the only information that is revealed is that which can enables SAQE to trade-off among these three criteria to scale its be deduced from the query output. Finally, after query process- query processing to very large datasets with accuracy bounds de- ing, the system adds carefully calibrated noise to the query results pendent only on sample size, and not the raw data size. using differential privacy [22] such that an adversary viewing this output will get essentially the same answer whether or not an indi- PVLDB Reference Format: Johes Bater, Yongjoo Park, Xi He, Xiao Wang, Jennie Rogers. SAQE: Prac- vidual chose to make his or her data available for querying. This tical Privacy-Preserving Approximate Query Processing for Data Federa- system has an integrated suite of differential privacy mechanisms tions. PVLDB, 13(11): 2691-2705, 2020. for modeling this controlled information leakage both during and DOI: https://doi.org/10.14778/3407790.3407854 after query processing. Existing differential privacy systems [36,46] with a single trusted data curator do not have to protect data during query processing. This work is licensed under the Creative Commons Attribution- Conversely, a private data federation must add noise using secure NonCommercial-NoDerivatives 4.0 International License. To view a copy computation [10] so that no party can deduce the true query answer of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing and any client that colludes with a data provider is unable to break [email protected]. Copyright is held by the owner/author(s). Publication rights the system’s privacy guarantees, even under repeated querying of licensed to the VLDB Endowment. a single dataset. To uphold this strong privacy guarantee of the Proceedings of the VLDB Endowment, Vol. 13, No. 11 input data, private data federations must add noise to query results, ISSN 2150-8097. reducing accuracy, and thereby compromising their utility. DOI: https://doi.org/10.14778/3407790.3407854 2691 4 Differential Privacy ·10 Sampling Error 2 Accuracy Privacy Performance DP Noise Privacy Amplification Summed Error 1:5 Approximate Query Processing SAQE 1 Accuracy Privacy Performance Max Accuracy Private Data Federation Error (Variance in Tuple Count) 0:5 Private Sampling Algorithms Positive Effect Secure Multi-party Computation No Effect 0 Negative Effect 0:2 0:4 0:6 0:8 1 Accuracy Privacy Performance Sampling Rate Figure 1: Key components of SAQE query processing Figure 2: Expected error for COUNT query results in SAQE, n = 10; 000, = 0:05, and δ = 10−5. Secure Computation.. We use secure computation to obliviously compute over the data of two or more data providers. An oblivious SQL querying over private data federations by generalizing approx- program’s observable transcript (i.e., its program counter, mem- imate query processing to this setting. Unlike the na¨ıve approach, ory access patterns, network transmissions) leaks nothing about its SAQE generalizes approximate query processing to this setting by input data because these signals are data-independent. Oblivious carefully composing it with differential privacy and secure compu- query processing incurs extremely high overhead owing to its use tation, making their interactions synergistic through an integrated of heavyweight cryptographic protocols to encrypt and process in- model of their performance profiles and information leakage. We put data. In practice, an oblivious query execution runs in worst- illustrate SAQE’s composition of these techniques in Figure 1. case time to uphold rigorous security guarantees. For example, a First, SAQE ensures sampling-time privacy by introducing two 2 join with inputs of length n will have an output cardinality of n . private sampling algorithms. That is, the na¨ıve approach— sam- Oblivious database operators produce worst-case output cardinal- pling locally at each data provider then computing on all the sam- ities by padding their results with dummy tuples. These crypto- ples using secure computation—reveals the exact size of each sam- graphic protocols also incur significant computation and network ple and the data provider knows their precise contribution to the overhead to run with their data encrypted in flight. Queries over query results. A curious data provider could use this knowledge secure computation have runtimes that are multiple orders of mag- to deduce unauthorized information about private data from query nitude slower than running the same query with no security. Hence, results. Our private sampling algorithms prevent this as follows. systems that use secure computation alone to protect data under In our first private sampling algorithm (i.e., an oblivious sampling computation [9, 58] do not scale to datasets that are greater than algorithm), the data providers work together in secure computa- hundreds of megabytes. tion to winnow down their input data to a set of samples for use in Approximate Query Processing. When executing queries over query evaluation without leaking any information. It outperforms a extremely large data sets, big data systems speed up execution by straw-man method by reducing the sampling cost from O(n log2 n) using approximate query processing [3, 20, 35, 38, 49], evaluating to O(n log n), where n is the size of the input data. The second a query over a sample of its input data rather than computing the algorithm uses computational differential