Shrinkwrap: Efficient SQL Query Processing in Differentially Private Data Federations

Johes Bater Xi He William Ehrich Northwestern University University of Waterloo Northwestern University [email protected] [email protected] [email protected] Ashwin Machanavajjhala Jennie Rogers Duke University Northwestern University [email protected] [email protected]

ABSTRACT attention. Data breaches litter the news headlines, creat- A private data federation is a set of autonomous ing fear and a hesitance to share data, even among trusted that share a unified query interface offering in-situ evalua- collaborators. Without data sharing, information becomes tion of SQL queries over the union of the sensitive data of siloed and enormous potential analytical value is lost. its members. Owing to privacy concerns, these systems do Recent work in databases and cryptography attempts to not have a trusted data collector that can see all their data solve the data sharing problem by introducing the private and their member databases cannot learn about individual data federation [4]. A private data federation consists of records of other engines. Federations currently achieve this a set of data owners who support a common relational goal by evaluating queries obliviously using secure multi- schema. Each party holds a horizontal partition party computation. This hides the intermediate result car- (i.e., a subset of rows) of each in the database. A dinality of each query operator by exhaustively padding it. private data federation provides a SQL query interface to With cascades of such operators, this padding accumulates analysts (clients) over the union of the records held by the to a blow-up in the output size of each operator and a pro- data owners. Query evaluation is performed securely portional loss in query performance. Hence, existing private multiple data owners without revealing unauthorized infor- data federations do not scale well to complex SQL queries mation to any party involved in the query and without the over large datasets. assistance of a trusted data curator. We introduce Shrinkwrap, a private data federation that A private data federation must provably ensure the fol- offers data owners a differentially private view of the data lowing guarantee: a data owner should not be able to re- held by others to improve their performance over oblivious construct the database (or a part of it) held by other data query processing. Shrinkwrap uses computational differen- owners based on the intermediate result sizes of the query tial privacy to minimize the padding of intermediate query evaluation. One way to achieve this without a trusted data results, achieving up to a 35X performance improvement curator is to use secure computation protocols, which pro- over oblivious query processing. When the query needs dif- vide a strong privacy guarantee that intermediate results ferentially private output, Shrinkwrap provides a trade-off leak no information about their inputs. between result accuracy and query evaluation performance. Current implementations of private data federations that use secure computation have a performance problem: state- PVLDB Reference Format: of-the-art systems have slowdowns of 4–6 orders of magni- Johes Bater, Xi He, Willi Ehrich, Ashwin Machanavajjhala, Jen- tude over non-private data federations [4, 56]. This slow- nie Rogers. Shrinkwrap: Differentially-Private Query Processing down comes from the secure computation protocols used by in Private Data Federations. PVLDB, 12(3): 307-320, 2018. private data federations to securely evaluate SQL queries DOI: https://doi.org/10.14778/3291264.3291274 without the need of a trusted third party. A query is ex- ecuted as a directed acyclic graph of database operators. 1. INTRODUCTION Each operator takes one or two inputs, applies a function, and outputs its result to the next operator. In a typical The storage and analysis of data has seen dramatic growth (distributed) database engine, the execution time needed to in recent years. Organizations have never valued data more compute each intermediate result and the size of that result highly. Unfortunately, this value has attracted unwanted leaks information about the underlying data to curious data owners participating in the secure computation. To prevent this leakage, private data federations dummy tuples This work is licensed under the Creative Commons Attribution- to pad intermediate results to their maximum possible size, NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For thereby ensuring that execution time is independent of the any use beyond those covered by this license, obtain permission by emailing input data. With secure evaluation, query execution runs [email protected]. Copyright is held by the owner/author(s). Publication rights in the worst-case, drastically increasing computation costs licensed to the VLDB Endowment. as intermediate result sizes grow. While performance is rea- Proceedings of the VLDB Endowment, Vol. 12, No. 3 sonable for simple queries and small workloads, performance ISSN 2150-8097. is untenable for complex SQL queries with multiple joins. DOI: https://doi.org/10.14778/3291264.3291274

307 Several approaches attempt to solve this performance prob- lem, but they fail to provably bound the information leaked Private Data to a data owner. One line of research uses Trusted Exe- End User Owner cution Environments (TEEs) that evaluate relational oper- SQL output of q ators within an on-chip secure hardware enclave [54, 56]. query q over all DBs

TEEs efficiently protect query execution, but they require secure query plan q specialized hardware from chip manufacturers. Moreover, Query current TEE implementations from both Intel and AMD Coordinator secret shares of q’s do not adequately obscure computation, allowing observers output (1/data owner) to obtain supposedly secure data through widely publicized Secure Computation hardware vulnerabilities [9, 17, 28, 31, 49]. Another ap- proach selectively applies homomorphic encryption to evalu- Figure 1: Private data federation architecture ate relational operators in query trees, while keeping the un- derlying tuples encrypted throughout the computation (e.g., Section 6 describes how we implement specific secure com- CryptDB [45]). While this system improves performance, it putation protocols on top of Shrinkwrap’s protocol agnostic leaks too much information about the data, such as statistics design. We experimentally evaluate our system implementa- and memory access patterns, allowing a curious observer to tion over real-world medical data in Section 7. We conclude deduce information about the true values of encrypted tu- with a survey of related work and future directions. ples [15, 26, 39, 7]. In this work, we bridge the gap between provable secure 2. PRIVATE DATA FEDERATION systems with untenable performance and practical systems In this section, we formally define a private data feder- with no provable guarantees on leakage using differential ation (PDF), describe privacy goals and assumptions, and privacy [14]. Differential privacy is a state-of-the-art tech- define two security primitives – secure computation and dif- nique to ensure privacy, and provides a provable guarantee ferential privacy. that one can not reconstruct records in a database based A private data federation is a collection of autonomous on outputs of a differentially private algorithm. Differen- databases that share a unified query interface for offering in- tially private algorithms, nevertheless, permit approximate situ evaluation of SQL queries over the union of the sensitive aggregate statistics about the dataset to be revealed. data of its members without revealing unauthorized infor- We present Shrinkwrap, a system that improves private mation to any party involved in the query. A private data data federation performance by carefully relaxing the pri- federation has three types of parties: 1) two or more data vacy guaranteed for data owners in terms of differential pri- owners DO1,..., DOm that hold private data D1,...,Dm vacy. Instead of exhaustively padding intermediate results respectively, and all Di share a public schema of k re- to their worst-case sizes, Shrinkwrap obliviously eliminates lations (R1,...,Rk); 2) a query coordinator that plans and dummy tuples according to tunable privacy parameters, re- orchestrates SQL queries q on the data of the data owners; ducing each intermediate cardinality to a new, differentially and 3) a client that writes SQL statements to the query co- private value. The differentially private intermediate re- ordinator. The set of private data D¯ = (D1,...,Dm) owned sult sizes are close to the true sizes, and thus, Shrinkwrap by the data owners is a horizontal partition of every table achieves practical query performance. in the total data set D. To the best of our knowledge, Shrinkwrap is the first sys- As shown in Figure 1, the client first passes an SQL state- tem for private data sharing that combines differential pri- ment q(·) to the query coordinator, who compiles the query vacy with secure computation for query performance opti- into an optimized secure query plan to be executed by each mization. The main technical contributions in this work are: data owner. Once the data owners finish execution, they • A query processing engine that offers controlled informa- each pass their secret share of the output to the query coor- tion leakage with differential privacy guarantees to speed dinator, who assembles and returns the result to the client. up private data federation query processing and provides tunable privacy parameters to control the trade-off be- 2.1 Privacy Goals and Assumptions tween privacy, accuracy, and performance Table 1 summarizes the privacy goals for all parties in- • A computational differential privacy mechanism that se- volved in this process and the necessary assumptions re- curely executes relational operators, while minimizing in- quired for achieving these goals. termediate result padding of operator outputs Privacy Goals: Data owners require a privacy guarantee • A novel algorithm that optimally allocates, tracks, and that the other parties (data owners, query coordinator and applies differential privacy across query execution the client) are not able to reconstruct the private data they hold based on intermediate results of the query execution. • A protocol-agnostic cost model that approximates the In particular, we require data owners to have a differentially large, and non-linear, computation overhead of secure private view over the inputs of other data owners (formally computation as it cascades up an operator tree defined in Def. 2). We aim to support two policies for clients. The rest of this paper is organized as follows. In Sec- Clients may be trusted (output policy 1) and allowed to tion 2 we define private data federations, outline our privacy see true answers to queries but must not learn any other goals and formally define secure computation and differen- information about the private inputs held by data owners. tial privacy. Section 3 describes the problem addressed by Clients may be untrusted (output policy 2), in which case Shrinkwrap. Our end-to-end solution, Shrinkwrap, and its they only are permitted a differentially private view over the privacy guarantees, is described in Section 4. We show how inputs. The query coordinator is an extension of the client to optimize the performance of Shrinkwrap in Section 5. and has the same privacy requirements.

308 Table 1: Privacy Goals and Assumptions where negl(κ) refers to any function that is o(κ−c), for all constants c and OUTPUTΠ(D¯) denotes the output sequence Client Sees Client Sees of all parities during the execution represented in VIEWΠ(D¯). True Answers Noisy Answers I Privacy Goal Differentially private (DP) view over Various cryptographic protocols that enable secure compu- for DOs data held by other DOs tation have been studied [5, 10, 11, 18, 46, 33, 52, 55] and Privacy Goal Learns only DP view over several are applicable to relational operators [4, 8, 50]. for Client query answer data held by DOs Differential Privacy (DP): Differential privacy [14] is an Client & DO Semi-honest & Computationally Bounded Assumption appealing choice to bound the information leakage on the in- DOs Colludes dividual records in the input while allowing multiple releases Not allowed Allowed with Client of statistics about the data. This privacy notion is utilized Client Colludes by numerous organizations, including the US Census Bu- Not allowed Allowed with some DOs reau [34], Google [16], and Uber [27]. Formally, differential privacy is defined as follows. Assumptions: To achieve the privacy goals presented above, Definition 2 ((, δ)-Differential Privacy). A ran- a private data federation assumes that all parties are com- domized mechanism f : D → R satisfies (, δ)-differential putationally bounded and work in the semi-honest setting. privacy if for any pair of neighboring databases D,D0 ∈ D Hence, the query coordinator honestly follows the protocols such that D and D0 differ by adding or removing a row and and creates a secure plan to evaluate the query. All parties any set O ⊆ R, faithfully execute the cryptographic protocol created by the query coordinator, but may attempt to infer properties of Pr[f(D) ∈ O] ≤ e Pr[f(D0) ∈ O] + δ. (2) private inputs held by other parties by observing the query instruction traces and data access patterns. The query co- The differential privacy guarantee degrades gracefully when ordinator is also assumed to be memory-less as it destroys invoked multiple times. In the simplest form, the overall pri- its contents after sending query results back to the client. vacy loss of multiple differentially private mechanisms can When the client sees the true answers (output policy 1), we be bounded with sequential composition [13]. assume that the client is trusted to not collude with data Theorem 1 (Sequential Composition). If M1 and owners; otherwise, if the client shares the true answer with M2 are (1, δ1)-DP and (2, δ2)-DP algorithms that use inde- a data owner (or publishes the true answer), the data owner pendent randomness, then releasing the outputs M1(D) and who colludes with the client can use their own private data M2(D) on database D satisfies (1 + 2, δ1 + δ2)-DP. and the true query answer to infer the input of the other data owners. Similarly, we assume the data owners do not There exist advanced composition theorems that give tighter share their true input with the client; otherwise, the client bounds on privacy loss under certain conditions [14], but we can infer the other data owners’ input and gain more in- use sequential composition as defined above as it is general formation than just the query answer. When the client is and easy to compute. not trusted, we assume that the client may collude with (as many as all but one) data owners. Our guarantees hold even 3. PROBLEM STATEMENT in that extreme case. Our goal is to build a system that permits efficient query 2.2 Security Primitives evaluation on private data federations while achieving prov- able guarantees on the information disclosed to clients and Secure Computation: To securely compute functions on data owners. As discussed in Section 1, prior work is divided distributed data owned by multiple parties, secure compu- into two classes: (i) systems that use secure computation to tation primitives are required. Let f : Dm → Rm be a answer queries, which give formal privacy guarantees but 1 ¯ m ¯ functionality over D = (D1,...,Dm) ∈ D , where fi(D) suffer 4–6 orders of magnitude slowdowns compared to non- denotes the i-th element of q(D¯). Let Π be an m-party private data federations [4, 56], and (ii) systems that use Π ¯ protocol for computing f and VIEWi (D) be the view of secure hardware [54, 56] or property preserving encryption the i-th party during an execution of Π on D¯. For I = [45], which are practical in terms of performance, but have ¯ {i1, . . . , it} ⊆ [m], we let DI = (Di1 ,...,Dit ), fI (D) = no formal guarantee on the privacy afforded to clients and ¯ ¯ Π ¯ Π ¯ (qi1 (D), . . . , qit (D)), and VIEWI (D) = (I, VIEWi1 (D),..., data owners. Π ¯ VIEWi1 (D)). We define secure computation with respect to While secure computation based solutions are attractive, semi-honest behavior [19] as follows. and there is much recent work in improving their efficiency [6, 24, 25], secure computation solutions fundamentally limit We say an m- Definition 1 (secure computation). performance due to several factors: 1) a worst-case running party protocol securely computes f : Dm → Rm if there ex- time is necessary to avoid leaking information from early ists a probabilistic polynomial-time algorithm denoted S and termination; 2) computation must be oblivious, i.e., the pro- negl(κ) such that for any non-uniform probabilistic polyno- gram counters and the memory traces must be independent mial adversary A and for every I ⊆ [m], every polynomial of the input; 3) cryptographic keys must be sent and com- ˆ m p(·), every sufficiently large κ ∈ N, every D ∈ D of size at puted on by data owners, the cost of which scales with the most p(κ), it holds that complexity of the program.

| Pr[A(S(I,DI , fI (D¯)), f(D¯)) = 1]− (1) In particular, applying secure computation protocols to Π ¯ Π ¯ execute SQL queries results in extremely slow performance. Pr[A(VIEWI (D), OUTPUT (D)) = 1]| ≤ negl(κ), A SQL query is a directed acyclic graph of database oper- 1Functionality f can be both deterministic or randomized ators. Each operator in the tree takes an input, applies a

309 SELECT DISTINCT pid FROM demographic de, diagnosis di 4. SHRINKWRAP medication m 3 WHERE di.diag="hd" AND m.med = "aspirin" DISTINCT [n ] In this section, we present the end-to-end algorithm of AND di.code = m.code AND de.pid = m.pid Shrinkwrap for answering a SQL query in a private data fed- ./pid [n3] eration, the key differentially private mechanism for safely revealing the intermediate cardinality of each operator, and the overall privacy analysis of Shrinkwrap. ./code [n2] demographic [n] 4.1 End-to-end Algorithm

σdiag=hd [n] σmed=aspirin [n] At its core, Shrinkwrap is a system that applies differential privacy throughout SQL query execution to reduce interme- diate result sizes and improve performance. The private diagnosis [n] medication [n] data federation ingests a query q as a directed acyclic graph (DAG) of database operators Q = {o1, . . . , ol}, following Figure 2: Exhaustive padding of intermediate results in an the order of a bottom up tree traversal. The query coordi- oblivious query evaluation nator decides the output policy based on the client type and assigns the privacy parameters (, δ) for the corresponding function, and outputs its result to the next operator. The privacy goals. To improve performance, a privacy budget execution time needed to calculate each intermediate result of ( , δ ) will be spent on protecting intermediate car- and the array size required to hold that result leaks infor- 1→l 1→l dinalities of the l operators. When the client is allowed to mation about the underlying data. To address this leakage, learn the true query output (output policy 1), then  =  private data federations must insert dummy tuples to pad 1→l and δ = δ; when the client is allowed to learn noisy intermediate results to their maximum possible size and en- 1→l query answers from a differentially private mechanism, then sure that execution time is independent of the input data.  <  and δ < δ, where the remaining privacy budget We demonstrate this property in the following example. 1→l 1→l ( − 1→l, δ − δ1→l) will be spent on the query output. Example 1. Figure 2 shows a SQL query and its corre- Algorithm 1 outlines the end-to-end execution of a SQL sponding operator tree. Tuples flow from the source relations query in Shrinkwrap. Inputs to the algorithm are the query through a filter operator, before entering two successive Q, the overall privacy parameters (, δ), the privacy budget operators and ending at a distinct operator. To hide the se- for improving the performance (1→l, δ1→l), and the pub- lectivity of each operator, the private data federation pads lic information K, including the relational database schema each intermediate result to its maximum value. At the fil- and the maximum possible input data size at each party. ters, the results are padded as if no tuples were filtered out, First, the query coordinator allocates the privacy budget remaining at size n. This is a significant source of addi- for improving performance (1→l, δ1→l) among the operators tional computation. If the selectivity of a filter is 10−3, our in Q such that padding gives a 1000× performance overhead. As the tu- l l ples flow into the first join, which now receives two inputs of X X 2 i = 1→l and δi = δ1→l (3) size n, exhaustive padding produces an output of n tuples. i=1 i=1 When this result passes through the next join, the result size 3 that the distinct operator must process jumps to n . If we where (i, δi) is the privacy budget allocated to operator −3 once again assume a 10 selectivity at the joins, we now oi. If i = δi = 0, then the operator is evaluated oblivi- see a 109× computation overhead. Such a significant slow- ously (without revealing the intermediate result size). When down arises from the cumulative effect of protecting cascad- i, δi > 0, Shrinkwrap is allowed to reveal an overestimate of ing operators. By exhaustively padding operator outputs, we the intermediate result size, with larger (i, δi) values giving see ever-increasing intermediate result cardinalities and, as tighter intermediate result estimates. The performance im- a function of that growth, ever-decreasing performance. provement of Shrinkwrap highly depends on how the budget is split among the operators, but any budget split that satis- To tackle this fundamental performance challenge, we for- fies Equation 3 satisfies privacy. We show several allocation malize our research problems as follows. strategies in Section 5 and analyze them in the evaluation. • Build a system that uses differentially private leakage of Next, the query coordinator compiles a secure query plan intermediate results to speed up SQL query evaluation to be executed by the data owners. The secure query plan while achieving all the privacy goals stated in Section 2. traverses the operators in Q. For each operator oi, data • Given an SQL operator o and a privacy budget (, δ), owners compute secret shares of the true output using secure design an (, δ)-differentially private mechanism that ex- computation over the federated database and the output of ecutes the operator with a smaller performance overhead other operators computed from D. We denote this secure compared to fully oblivious computation. computation for oi by SMC(oi,D). The true outputs (of size ci) are placed into a secure array Oi. The secure array • Given a sequence of operators Q = {o1, . . . , ol}, design is padded (with encrypted dummy values) so that its size an algorithm to optimally split the privacy budget (, δ) equals the maximum possible output size for operator oi. among these operators during query execution. Then, Shrinkwrap calls a (i, δi)-differentially private func- In the following sections, we present our system for effi- tion Resize() to resize this secure array Oi to a smaller array cient SQL query processing, starting with the differentially Si such that a subset of the (encrypted) dummy records are private mechanisms and overall privacy analysis used for op- removed. Resizing is described in Section 4.2. Once all the erator evaluation (Section 4), and then our budget splitting operators are securely evaluated, if the output policy allows algorithm that optimizes performance (Section 5). release of true answers to the client (policy 1), the last secure

310 Algorithm 1: End-to-end Shrinkwrap algorithm Based on these inputs, the Resize() function runs a (i, δi)- differentially private truncated Laplace mechanism (Def. 4) Input: PDF query DAG with operators Q = {o1, . . . , ol}, the federated database D, public information K, to obtain a new, differentially private cardinality c˜i > ci for privacy parameters (, δ), privacy budget for the secure array. Next, this function sorts the exhaustively performance (1→l ≤ , δ1→l ≤ δ) padded secure array Oi obliviously such that all dummy Output: Query output R tuples are at the end of the array and then copies the first //Computed on query coordinator c˜i tuples in the sorted array into a new secure array Si of Pl Pl P = {(1, δ1),..., (l, δl)| i=1 i = 1→l, i=1 δi = δ1→l} size c˜i. This new secure array Si of a smaller size than Oi ← AssignBudget(1→l, δ1→l, Q, K) // (Sec. 5) is used for the following computations. //Computed on private data owners for i ← 1 to l do Noise Generation: Shrinkwrap generates differentially pri- (Oi, ci) ← SMC(oi,D) //with exhaustive padding vate cardinalities by using a truncated Laplace mechanism. Si ← Resize(Oi,ci,i,δi,∆ci) // DP resizing (Sec. 4.2) Given an operator o, with a privacy budget (, δ) and the end sensitivity parameter for the cardinality query ∆c, this mech- if  = 1→l and δ = δ1→l then anism computes a noisy cardinality for use in Algorithm 1. Send query coordinator Sl //output policy 1 else Query output budget 0 ← ( − 1→l), δ0 ← (δ − δ1→l) ( ,δ ) Definition 4 (Truncated Laplace Mechanism). Send query coordinator M 0 0 (Sl) //output policy 2 end Given a query c : D → N, the truncated Laplace mechanism TLap(, δ, ∆c) adds a non-negative integer noise max(η, 0) //Computed on query coordinator to the true query answer c(D), where η ∈ follows a distri- R ← Assemble(Sl) Z return R to client bution, denoted by L(, δ, ∆c) that has a probability density 0 /∆c function Pr[η = x] = p · e−(/∆c)|x−η |, where p = e −1 , function Resize(O, c, , δ, ∆c) e/∆c+1 /∆c c˜ ← c+TLap(, δ, ∆c) // perturb cardinality η0 = − ∆c·ln((e +1)δ) + ∆c. O ← Sort(O) //obliviously sort dummy tuples to the end  S ← new SecureArray(O[1,..., c˜]) return S This mechanism allows us to add non-negative integer noise to the intermediate cardinality query which corresponding array Sl will be directly returned to the query coordinator; to the number of the dummy records. The noise η drawn otherwise, the data owners jointly and securely compute a from L(, δ, ∆c) satisfies that Pr[η < ∆c] ≤ δ, which enables differentially private mechanism on Sl with the remaining us to show the privacy guarantee of this mechanism. privacy budget (0 =  − 1→l, δ0 = δ − δ1→l), denoted by (0,δ0) (0,δ0) M (Sl), and return the output of M (Sl) to the query coordinator. For instance, one could use a Laplace Theorem 2. The truncated Laplace mechanism satisfies mechanism to perturb count queries using a multiparty pro- (, δ)-differential privacy. tocol for generating noise as in [38]. Finally, the query co- ordinator assembles the final secure array received from the Proof. For any neighboring databases D1,D2, let c1 = data owners Sl, decrypting the final result R and returning c(D1) ≥ 0 and c2 = c(D2) ≥ 0. W.L.O.G. we consider it to the client. c2 ≥ c1. It is easy to see that Pr[TLap(D1) ∈ (−∞, c1)] = 0 and Pr[TLap(D2) ∈ (−∞, c2)] = 0. By the noise property, Pr[TLap(D2)=o] 4.2 DP Resizing Mechanism for any o ≥ c2 ≥ c1, it is true that | ln | ≤ . Pr[TLap(D1)=o] For each operator oi, Shrinkwrap first computes the true However, when the output o ∈ [c1, c2), Pr[TLap(D2) = result using secure computation and places it into a exhaus- o] = 0, but the Pr[TLap(D2) = o] > 0, making the ra- tively padded secure array Oi. For example, as shown in tio of probabilities unbounded. Nevertheless, we can bound Figure 3, a join operator with two inputs, each of size n, Pr[TLap(D2) ∈ [c1, c2)] by δ as shown below. 2 will have a secure array Oi of size n . This is the exhaustive padding case shown on the left side of the figure, with the private data federation inserting the materialized interme- diate results into the secure array and padding those results Dummy with dummy tuples. The Resize() function takes in the ex- Data haustively padded secure array Oi, the true cardinality of 2 ・ the output ci, the privacy parameters (i, δi), and the sen- n = |R| |S| Dummy Lap(�,δ, Δc) Exhaustive Padding Data DP Noise sitivity of the cardinality query at operator oi, denoted by ∆ci. The cardinality query ci(D) returns the output of the Real Real operator oi in the PDF query q. The sensitivity of a query |R⋈S| Data Data is defined as follows. True Cardinality

Definition 3 (Query Sensitivity). The sensitivity of Without With ∗ a function f : D → R is the maximum difference in its out- Shrinkwrap Shrinkwrap put for any pairs of neighboring databases D and D0, Figure 3: Effect of Shrinkwrap on Intermediate Result 0 ∆f = max kf(D) − f(D )k1 Sizes When Joining Tables R and S {D,D0 s.t. |D−D0|∪|D0−D|=1}

311 ∗ Let O = (−∞, c2). We can show that for any output set Resize() function, Shrinkwrap takes an intermediate result O of query c(), we have secure array and sorts it so that all dummy tuples are at the end of the secure array. Next, Shrinkwrap cuts off the Pr[TLap(D1) ∈ O] end of the original secure array and copies the rest into the ∗ ∗ = Pr[TLap(D1) ∈ (O ∩ O )] + Pr[TLap(D1) ∈ (O − O )] new, differentially-private sized secure array. Since the new  ∗ ≤ Pr[TLap(D1) ∈ [c1, c2)] + e Pr[TLap(D2) ∈ (O − O )] secure array has a size guaranteed to be larger than the  true cardinality, we know that cutting off the end will only = Pr[η1 < ∆c] + e Pr[TLap(D2) ∈ O]  remove dummy tuples. Thus, we avoid paying the full I/O = δ + e Pr[TLap(D2) ∈ O] (4) cost for each tuple. However, we do pay an O(n log(n)) Hence, this mechanism satisfies (, δ)-DP. cost for the initial sorting, as well as an O(n) cost for bulk copying the tuples. This algorithm is an extension of the one in [21] and variants of it are in [40, 41, 56]. Sensitivity Computation: In order to create our noisy cardinalities using the truncated Laplace mechanism in Defi- nition 4, we must determine the cardinality query sensitivity 4.3 Overall Privacy Analysis at the output of each operator in the query tree. An interme- Given a PDF query DAG q with operators {o1, . . . , ol} and diate result cardinality, ci, is the output size of the sub-query a privacy parameter (, δ), Algorithm 1 achieves the privacy tree rooted at operator oi. The sensitivity of ci depends on goals stated in Section 2. the operators in its sub-query tree, and we compute this as a function of the stability of operators in the sub-query tree. Theorem 3. Under the assumptions in Section 2, Algo- The stability of an operator bounds the change in the output rithm 1 achieves the privacy goals: data owners have a (, δ)- with respect to the change in the input of the operator. computational differentially private view over the input data of other data owners; when true answers are returned to the Definition 5 (Stability [36]). A transformation op- client, the client learns nothing else; when noisy answers are erator is s-stable if for any two input data sets A and B, returned the client, the client has a (, δ)-computational dif- ferentially private view over the input data of all the data |T (A) T (B)| ≤ s × |A B| owners. For unary operators, such as SELECT, PROJECT, the sta- Proof. (sketch) We prove that the view of data own- bility usually equals one. For JOIN operators, a binary op- ers satisfies computational differential privacy [37, 48] by erators, the stability is equal to the maximum multiplicity showing that the view of any data owner is computationally of the input data. The sensitivity of the intermediate car- indistinguishable from its view in an ideal simulation that dinality query c can be recursively computed bottom up, i outputs only the differentially private cardinalities of each starting from a single change in the input of the leaf nodes. operator output. Example 2. Given a query tree shown in Figure 4, the Let Sim be a simulator that takes as input the horizontal bottom two filter operators each have a stability of one, giving partitions held by the data owners (D1, ..., Dm), and out- ⊥ ⊥ a sensitivity (∆c) of 1 at that point. Next, the first join has puts a vector Si for operator oi such that: (a) |Si | is drawn ⊥ a stability of m for the maximum input multiplicity, which from ci + TLap(), and (b) Si contains encryptions of 0 (of increases the overall sensitivity to m. The following join op- the appropriate arity). It is easy to see that a function com- erator also has a stability of m, which changes the sensitivity puted by Sim satisfies (1→l, δ1→l)-DP: (a) at each operator, ⊥ to m2. Finally, the DISTINCT operator has a stability of 1, the release of |Si | satisfies (i, δi)-DP, and (b) By sequen- so the overall sensitivity remains m2. tial composition (Theorem 1) and Equation 3, releasing all the cardinalities satisfies (1→l, δ1→l)-DP. When the output DISTINCT 2 itself is differentially private, there is an additional privacy [∆c5 = m ] loss to data owners (since the client could release the output to the data owners), but the overall algorithm still satisfies ./pid 2 (, δ)-DP by sequential composition. [∆c4 = m ] The real execution outputs secure arrays Si at each opera- ./pid demographics tor o using secure computation. This ensures that the view [∆c = m] i 3 of a data owner in the real execution is computationally σdiag=hd σmed=aspirin indistinguishable from the execution of Sim, which satis- [∆c1 = 1] [∆c2 = 1] fies (, δ)-DP. Thus, we can show that the view of any data

medication owner satisfies (, δ)-computational DP. The privacy guaran- diagnosis teed to the client and query coordinator can be analogously argued. Figure 4: Sensitivity analysis for running example Using these principles, Shrinkwrap calculates the sensitivity 4.4 Multiple SQL Queries for each operator in a query tree for use in Laplace noise The privacy loss of Shrinkwrap due to answering multi- generation during query execution. More details can be seen ple queries is analyzed using sequential composition (Theo- in PINQ [36] for computing the sensitivity of SQL queries. rem 1). There are two approaches to minimize privacy loss Secure Array Operations: Each time Shrinkwrap adds across multiple queries: (i) applying advanced composition or removes tuples, it normally needs to pay an I/O cost theorems that give tighter bounds on privacy loss under cer- for accessing the secure array. We avoid paying the full tain conditions [14] or (ii) optimizing privacy budget alloca- cost by carrying out a bulk unload and load. To start the tion across the operators of a workload of predefined SQL

312 queries (e.g. using [35]). These two approaches are orthog- if operator oi takes two input tables of size n1 and n2, then 0 onal to our work and adapting these ideas to the context of oi(N = {n1, n2}) = n1 · n2. The number of tuples n copied secure computation is an interesting avenue for future work. to the new array depends on the noisy output size ni. The input tables to an operator can be a mix of base ta- 5. PERFORMANCE OPTIMIZATION bles and the intermediate output tables. Given a database of k relations (R ,...,R ), there will be an additional l num- A key technical challenge is to divide the privacy budget 1 k ber of intermediate output tables generated in a query tree for performance ( , δ ) across the different operators 1→l 1→l of l operators. The public knowledge K contains the ta- so the overall runtime of the query execution is minimized. ble sizes for the original tables, but the output cardinalities Problem 1. Given a privacy budget (1→l, δ1→l), a PDF of the operators are unknown. In the private setting, we query DAG q with operators Q = {o1, . . . , ol}, and public in- cannot use the true cardinality for the estimation. Instead, formation K which includes the sensitivities of the interme- Shrinkwrap uses the naive reduction factors from [47] to diate cardinality queries {∆c1,..., ∆cl} and the schema in- estimate the output cardinality of each operator, denoted formation, compute the share (i, δi) assigned to oi such that: by estimate(ci,K). The noise ηi added to the cardinality Pl depends on the noise distribution of the truncated Laplace (a) the assignment ensures privacy, i.e., i=1 i = 1→l, and (b) the overall run time of the query is minimized. mechanism and we use the expectation of the distribution TLap(i, δi, ∆ci) to model the noise. Putting this together, 5.1 Baseline Privacy Budget Allocation the output cardinality ni at operator oi is estimated by estimate(ci,K) + E(TLap(i, δi, ∆ci)). We first consider two baseline strategies for allocating the With this formulation, the cost of assigning privacy bud- privacy budget to individual operators. get P = {(i, δi)|i = 1, . . . , l} to operators Q = {o1, . . . , ol} Eager: This approach allocates the entire budget (1→l, δ1→l) in the query tree is modeled as to the first operator in the query tree. In Figure 2, this l equates to minimizing the output cardinality of the filter on X C(P,K) = cost (N )+cost (o (N ))+cost (o (N ), n ) diagnoses and executing the rest of the tree obliviously. A oi i s i i c i i i i=1 benefit of this approach is that since the intermediate results (5) of the first operator flow through the rest of the tree, every where ni = estimate(ci,K) + E(TLap(i, δi, ∆qi)). operator will see the benefit of minimizing the output cardi- nality. On the other hand, in some trees, certain operators, Optimization: Using this cost model, we find the optimal like joins, have an output-sized effect on the total query solution to the following problem as the budget allocation performance. Not allocating any privacy budget to those strategy for Algorithm 1. operators results in a missed optimization opportunity. l l X X Uniform: The second approach splits the privacy budget minP C(P ) s.t. i ≤ 1→l, δi ≤ δ1→l, (1→l, δ1→l) uniformly across the query tree, resulting in i=1 i=1 equal privacy parameters for each operator. The benefit i, δi ≥ 0 ∀i = 1, . . . l (6) here is that every single operator receives part of the bud- get, so we do not lose out on performance gains from any We show the detailed cost model including the unit cost function cost (N , n ), cost (o (N )), and cost (o (N ), n ) operator. The drawback of the uniform approach, like the oi i i s i i c i i i one-time approach, is that some operators produce larger for two different secure computation protocols in the next effects, so by not allocating more of the budget to them, we section. Using a convex optimization solver, we determine lose out on potential performance gains. the optimal budget split that minimizes the cost objective function. We evaluate how it performs against the baseline 5.2 Optimal Privacy Budget Allocation approaches in Section 7. We design a third approach which uses an execution cost model as an objective function and applies convex optimiza- 6. PROTOCOL IMPLEMENTATION tion to determine the optimal budget splitting strategy. Shrinkwrap is a flexible computation layer that supports a Cost Model: The execution cost of a query in Algorithm 1 wide range of private data federations and database engines. mainly consists of (a) the cost for securely computing each The only requirements are that the private data federation operator and (b) the I/O cost of handling each intermediate process queries as operator trees and that the underlying result by Shrinkwrap. We model these costs as functions of database engine supports SQL queries. Shrinkwrap imple- the corresponding input and output data sizes. The cost of ments a wide range of SQL operators for PDF queries, in- secure computation at operator oi is represented by a func- cluding selection, projection, aggregation, limit, and some tion costoi (Ni), where Ni is the set of the cardinalities of window aggregates. For joins, we handle equi-joins and cross the input tables. The I/O cost of Shrinkwrap handling each products. At this time we do not support outer joins or set intermediate result of operator oi is mainly the cost to sort operations. For output policy 2 where client sees noisy an- the exhaustively padded array of size n0 and the cost of copy- swers released from differentially private mechanisms, we ing the array of size n to the new array of size n0, denoted by support aggregate queries like COUNT for the final operator. 0 costs(n) and costc(n, n ) respectively. In particular, the size The fundamental design pattern of secure computation of the exhaustively padded array also depends on the sizes protocols is the circuit model, where functions are repre- of the input tables, represented by a function oi(Ni). Most sented as circuits, i.e. a collection of topologically ordered of the time, the size of the exhaustively padded array is the gates. The benefit of this model is that by describing how product of the input table sizes. For example, if operator oi to securely compute a single gate, we can compose the com- takes one input table of size n1, then oi(N = {n1}) = n1; putation across all gates to evaluate any circuit, simplifying

313 the design of the protocol. The downside is that circuits Table 2: Operator I/O cost are difficult for programmers to reason about. Instead, the von Neumann-style Random Access Machine (RAM) model Operator costo(N = {n1, n2,...}) Filter n ∗ c (n ) + n ∗ c (n ) handles the impedance mismatch between programmers and 1 read 1 1 write 1 Join n1 ∗ cread(n1) + n1 ∗ n2 ∗ cread(n2) circuits by creating data structures, such as ORAM, that +n1 ∗ n2 ∗ cwrite(n1 ∗ n2) utilize circuit-based oblivious algorithms for I/O. With the Aggregate n1 ∗ cread(n1) + cwrite(n1) RAM model, programmers no longer have to write programs Aggregate, Group By n1 ∗ cread(n1) + n1 ∗ cwrite(n1) as circuits. Instead, they can think of their data as residing Window Aggregate n1 ∗ cread(n1) + n1 ∗ cwrite(n1) Distinct n1 ∗ cread(n1) + n1 ∗ cwrite(n1) in secure arrays that can be plugged into their existing code. 2 Sort n1 ∗ log (n1) ∗ (cread(n1) + cwrite(n1)) The downside of the RAM model is that since the data struc- tures are more general purpose, performance lags behind di- next operator. With Shrinkwrap, we resize the intermedi- rect circuit model implementations of the same function. In ate ORAM arrays by obliviously eliminating excess dummy this work, we use Shrinkwrap with both RAM model and tuples, reducing the intermediate cardinalities of our results circuit model protocols, providing a general framework that and improving performance. Any ORAM implementation is can easily add new protocols to the query executor. compatible with Shrinkwrap, as long as its I/O characteris- tics can be similarly defined as in Table 2. 6.1 RAM Model 6.2 Circuit Model When integrating a new protocol, we first create an op- Alternatively, if a private data federation uses a circuit erator cost model that captures the execution cost of each based protocol to express database operators, we can esti- operator. For a RAM based protocol, we model the cost of mate the cost of an operator o, costo(N), with ngates gates secure computations costo(N) for commonly used relational in its circuit, a maximum circuit depth of dcircuit, nin ele- operators in Table 2, where N is the set of cardinalities of ments of input, and an output size of nout as: input tables of operator o. The cost function considered in this work focuses on the I/O cost of each relational operator costop(N) = cin ∗ nin + cg ∗ ngates + cd ∗ dcircuit + cout ∗ nout, in terms of reads and writes from a secure array. where cin and cout account for the encoding and decoding We let cread(n) and cwrite(n) be the unit cost for reading and writing of a tuple from a secure array of size n respec- costs needed for a given protocol, cg and cd account for the costs due to the gate count and circuit depth respectively, tively. Secure arrays protect their contents from attackers by P the number of input element is nin = nj and the guaranteeing that all reads or writes access the same num- nj ∈N Q number of output element is nout = nj . ber of memory locations. Typically, the data structures will nj ∈N shuffle either their entire contents or their access path on The generic nature of this model allows us to capture the each access, with different implementations providing per- performance features of all circuit-based secure computa- formance ranging from O(log n) to O(n log2(n)) per read or tion protocols. For example, in this work we utilize EMP write [2, 53]. This variable I/O cost guarantees that an ob- toolkit [51], a state of the art circuit protocol, in conjunc- server cannot look at memory access time or program traces tion with Shrinkwrap. As EMP has not been used in private to infer sensitive information. data federations, we implemented EMP with SMCQL. In- Given the unit cost for reading and writing, we can rep- stead of compiling operators into individual circuits, EMP resent the cost of secure computation for each operator as compiles a SQL query into a single circuit, though it still the sum of reading cost and writing cost. For example, the pads intermediate results within the circuit. In this setting, cost function for a Filter operator costF ilter(N = {n1}) is we can still apply Shrinkwrap as we did in the RAM model, the cost of reading n1 number of input records and writ- where we minimize the padding according to our differential ing n1 number of output records. For a Join operator over privacy guarantees. Instead of modifying the ORAM, we two input tables, the cost function costJoin(N = {n1, n2}) directly modify the circuit to eliminate dummy tuples. equals to the sum of n1 number of records reading from the For both our RAM model and our circuit model imple- first secure array of size n1,(n1 ∗n2) number of record read- mentations, Shrinkwrap uses a cost model to estimate the ing from the second array of size n2, and (n1 ∗ n2) number execution cost of intermediate result padding. With the of record writes, i.e. n1 ∗ cread(n1) + n1 ∗ n2 ∗ cread(n2) + model, we can carry out our privacy budget optimization n1 ∗ n2 ∗ cwrite(n1 ∗ n2). The other operators are modeled and allocation as seen in Section 5. Shrinkwrap’s protocol similarly in Table 2. The cost function for copying from agnostic design allows it to optimize the performance of se- an array of size n to an array of size n0 can be modeled as cure computation, regardless of the implementation. 0 0 0 0 costc(n, n ) = n ∗ cread(n) + n ∗ cwrite(n ). In this work, we use ObliVM [32] to implement our RAM 6.3 M-Party Support based relational operators, under a SMCQL [4] private data In this work, we implement Shrinkwrap using two party federation design. We chose ObliVM due to its compiler, secure computation protocols, meaning that we run our ex- which allows SMCQL to easily convert operators written as periments over two data owners. However, Shrinkwrap sup- C-style code into secure computation programs. However, ports additional data owners. By leveraging the associativ- Shrinkwrap can support more recent ORAM implementa- ity of our operators, we can convert all m party computa- tions such as DORAM [12] by summarizing their cost behav- tions into a series of binary computations. For example, ior as in Table 2 and using those operator costs in our cost assume we want to join to tables R and S, which are hor- model. In SMCQL, private intermediate results are held in izontally partitioned across 3 data owners as R1,R2,R3, secure arrays, i.e., oblivious RAM (ORAM), and operators and S1,S2,S3. We can join across the 3 data owners as: are compiled as circuits that receive an ORAM as input, (R1 ./ S1) ∪ (R1 ./ S2) ∪ (R3 ./ S3) ∪ (R2 ./ S1) ∪ (R2 ./ evaluate the operator, and output the result ORAM to the S2) ∪ (R2 ./ S3) ∪ (R3 ./ S1) ∪ (R3 ./ S2) ∪ (R3 ./ S3). We

314 Table 3: HealthLNK query workload.

Name Query Dosage Study SELECT DISTINCT d.pid FROM diagnoses d, medications m WHERE d.pid = m.pid AND medication = ’aspirin’ AND icd9 = ’circulatory disorder’ AND dosage = ’325mg’ Comorbidity SELECT diag, COUNT(*) cnt FROM diagnoses WHERE pid ∈ cdiff cohort ∧ diag <> ’cdiff’ cnt DESC LIMIT 10; Aspirin SELECT COUNT(DISTINCT pid) FROM diagnosis d JOIN medication m on d.pid = m.pid Count JOIN demographics demo on d.pid = demo.pid WHERE d.diag = ’heart disease’ AND m.med = ’aspirin’ AND d.time <= m.time; 3-Join SELECT COUNT(DISTINCT pid) FROM diagnosis d JOIN medication m on d.pid = m.pid JOIN demographics demo on d.pid = demo.pid JOIN demographics demo2 ON d.pid = demo2.pid WHERE d.diag = ’heart disease’ AND m.med = ’aspirin’ AND d.time <= m.time; can execute this series of binary joins using two party secure Query Workload: For our experiments, we chose three computation then union the results. With this construction, representative queries based on clinical data research proto- we can scale Shrinkwrap out to any number of parties. How- cols [23, 43] and evaluate Shrinkwrap on this workload us- ever, this algorithm is not efficient due to naively carrying ing de-identified medical records from the HealthLNK data out m2 secure computation operations for every m parties repository. We also generate synthetic versions of Aspirin we compute over. We can use m party secure computation Count that contain additional join operators. We refer to protocols [3, 5, 20], which have fewer high-level operators these queries by the number of join operators, e.g., 3-Join and made large performance strides, but these protocols are for 3 join operators. The queries are shown in Table 3. still expensive. We leave performance improvements in this Configuration: Shrinkwrap query processing runs on top setting for future work. of PostgreSQL 9.5 running on Ubuntu Linux. We evalu- ated our two-party prototype on 6 servers running in pairs. 7. RESULTS The servers each have 64 GB of memory, 7200 RPM NL- We now look at Shrinkwrap performance using both real SAS hard drives, and are on a dedicated 10Gb/s network. world and synthetic datasets and workloads. First, we cover Our results report the average runtime of three runs per ex- our experimental design and setup. Next, we evaluate the periment. We implement Shrinkwrap under both the RAM end to end performance of our system and show the per- model (using ObliVM [33]) and circuit model (using EMP[51]). formance vs privacy trade-off under both true and differ- For most experiments, we show results using the RAM model, entially private answer settings. Then, we look at how though corresponding circuit model results are similar. Un- Shrinkwrap performs at the operator level and the accuracy less otherwise specified, the results show the end-to-end run- of our cost model, followed by a discussion of our budget time of a query with output policy 1, i.e., true results. splitting strategies. Finally, we examine how Shrinkwrap scales to larger datasets and more complex queries. 7.2 End-to-end Performance In our first experiment, we look at the the end to end per- 7.1 Experimental Setup formance of Shrinkwrap compared to baseline private data For this work, we implemented Shrinkwrap on top of an federation execution. For execution, we set  to 0.5, δ to existing healthcare database federation that uses SMCQL [4]. 0.00005, and use the optimal budget splitting approach. This private data federation serves a group of hospitals that In Figure 5, we look at the overall performance of four wish to analyze the union of their electronic health record queries under Shrinkwrap for both RAM model and circuit systems for research while keeping individual tuples private. model secure computation. For Comorbidity, the execution A clinical data research network (CDRN) is a consortium does not contain any join operators, meaning that there is no of healthcare sites that agree to share their data for research. explosion in intermediate result cardinality, so Shrinkwrap CDRN data providers wish to keep their data private. We provides fewer benefits. Dosage Study contains a join op- examine this work in the context of HealthLNK [44], a CDRN erator with a parent distinct operator. Applying differen- for Chicago-area healthcare sites. This repository contains tial privacy to the output of the join improves performance records from seven Chicago-area healthcare institutions, each by close to 5x under the RAM model, but sees fewer ben- with their own member hospitals, from 2006 to 2012, total- efits under the circuit model. Aspirin Count contains two ing about 6 million records. The data set is selected from a joins and sees an order of magnitude improvement under the diverse set of hospitals, including academic medical centers, RAM model and a 2x improvement under the circuit model. large county hospitals, and local community health centers. Finally, 3-Join has an enormous cardinality blow-up, allow- Medical researchers working on HealthLNK develop SQL ing Shrinkwrap to improve its performance by 33x under the queries and run them on data from a set of healthcare sites. RAM model and 35x under the circuit model. In Section 7.6, These researchers are the clients of the private data federa- we examine this effect of join operators on performance. Al- tion and the healthcare sites are the private data owners. though implementation in the circuit model outperforms the RAM model by an order of magnitude, Shrinkwrap gives Dataset: We evaluate Shrinkwrap on medical data from similar efficiency improvements in both systems. two Chicago area hospitals in the HealthLNK data repos- itory [44] over one year of data. This dataset has 500,000 patient records, or 15 GB of data. For additional evalua- 7.3 Privacy, Performance, and Accuracy tion, we generate synthetic data up to 750 GB based on this We consider the performance vs privacy trade-off provided source medical data. To simplify our experiments, we use a by Shrinkwrap and evaluate the effect of the execution pri- public patient registry for common diseases that maintains vacy budget on query performance. For this experiment, we a list of anonymized patient identifiers associated with these use the 3-Join query with the optimal budget splitting ap- conditions. We filter our query inputs using this registry. proach, fix the total privacy budget  = 1.5 and δ = 0.00005,

315 1000000 Without Shrinkwrap 100000 With Shrinkwrap 100000 10000

10000 1000 1000 100 100

Execution Time (s) Time Execution 10 10 Average Execution Time (s)Time Execution Average

1 1 Comorbidity Dosage Study Aspirin Count 3-Join 0 0.5 1 1.5

Query Privacy Budget, �1→l

(a) RAM Model (ObliVM), (a) Performance v.s. 1→l. 100000 1000 Without Shrinkwrap With Shrinkwrap 10000

100 1000

100 10 Execution Time (s) Time Execution

Average Execution Time (s)Time Execution Average 10

1 Comorbidity Dosage Study Aspirin Count 3-Join 1 0 20 40 60 80 Query Final Result Error (# of tuples)

(b) Circuit Model (EMP) (b) Performance v.s. Error.

Figure 5: End-to-End Shrinkwrap Performance for All Figure 6: Privacy, Performance, and Accuracy Tradeoffs. Queries at  = 0.5, δ = .00005. Computed using different levels of privacy budget for per- formance  ∈ {0.1, ..., 1.5}. Executed using RAM Model, and use output setting 2, i.e. differentially private final an- 1→l 3-Join,  = 1.5, δ = 0.00005. swers. We vary the privacy budget for performance δ1−l from 0.1 to 1.5 and fix δ1→l = δ. The remaining privacy budget evenly across all the operators, eager inserts the en- budget −1→l is spent on the query output using a Laplace tire budget at the first operator, and uses the Shrinkwrap mechanism. cost model, along with cardinality estimates, to identify an In Figure 6a, we see the execution time as a function of optimal budget split. We also include an oracle approach the privacy budget for performance (1→l). As the privacy that shows the performance of Shrinkwrap if true cardinal- budget for performance 1→l increases, the execution time ities are used in the cost model to split the budget instead decreases. A lower budget means that more noise must be of cardinality estimates. Note that oracle does not satisfy present in the intermediate result cardinalities, which trans- differential privacy, but gives an upper bound on the best lates to larger cardinalities and higher execution time vari- performance achievable through privacy budgeting. For this ance. We know that I/O cost dominates the Shrinkwrap experiment, we use the Aspirin Count and 3-Join queries cost model, so larger cardinalities mean more I/O accesses, since they contain multiple operators where Shrinkwrap gen- which causes lower performance. erates differentially-private cardinalities and set the privacy In Figure 6b, we show the accuracy versus privacy trade- parameters usable during execution to  = 0.5 and δ = off by varying the privacy budget for performance 1→l. We .00005. report the execution time and the error introduced to the Figure 7 displays the relative speedup of all three ap- final query answer at various privacy budget for performance proaches and the oracle for both queries over the baseline, 1→l < 1.5. Out of a total possible output size of 5500 tuples, fully padded, private data federation execution. All three the query output is 10/5500 = 0.18% of the total possible Shrinkwrap approaches provide significant performance im- output size. If we allow additive noise of 0.36% (20/5500), provements, ranging from 3x to 33x. As expected, opti- we get a 100x performance improvement. Since the total mal performs best for both queries. We also see that eager privacy  is fixed, the more we spend for performance (larger performs better for Aspirin Count, while uniform performs 1→l), the less privacy budget we have at the output, i.e., as better for 3-Join. our execution time improves, our output noise increases. The benefits of minimized intermediate result cardinali- ties cascades as those results flow through the operator tree. By using the entire budget on the first join operator in As- 7.4 Evaluating Budget Splitting Strategies pirin Count, the eager approach maximizes the effect of the We first examine the relative performance of the three intermediate result cardinality reduction, and as a result, budget splitting strategies introduced in Section 6: uni- outperforms the uniform approach. However, for the 3-Join form, eager, and optimal. Recall that uniform splits the

316 40 Baseline Uniform Eager Optimal Oracle Uniform Eager Optimal Oracle 10000 35 33x 34x 30 1000 25 25x

20 100 15 8x 10 10 7x 8x

5 4x (s) Time Execution Average

Speedup Padding FulloverSpeedup 3x 1 0 Join M. Join D. Distinct Total Aspirin Count 3-Join Query Operator Query

Figure 7: Shrinkwrap execution speedup over baseline us- Figure 8: Per operator performance using different budget ing different budget strategies. Executed using RAM model, strategies. Executed using Aspirin Count, RAM model,  =  = 0.5, δ = .00005 0.5, δ = .00005 query, the presence of the additional join operator overrides the first join and reduces the intermediate cardinality by the cascading cardinality benefit of the first join operator. an order of magnitude. The optimal approach uses enough The uniform approach outperforms the eager approach by budget to reduce the intermediate cardinality, but not as ensuring that all three of the join operators receive differ- much as eager. entially private cardinalities. The optimal approach outper- The next operator, DISTINCT, also sees a large perfor- forms both eager and uniform by combining the best of both mance benefit. All three budget splitting approaches reduce worlds. Optimal applies differential privacy to all operators, the runtime for DISTINCT by about an order of magnitude. like uniform, but it uses a larger fraction of the budget on Since the baseline intermediate cardinality for DISTINCT is earlier operators, like eager. much larger, the uniform approach generates a differentially- Looking at the oracle approach, we can evaluate the accu- private cardinality that reduces the intermediate result size racy of our cost model. For privacy reasons, our cost model and improves performance. The eager approach does not does not use the true cardinalities. Here, we see that the have any additional budget to use, but the effect of the pre- optimal approach, which uses estimated cardinalities, and vious intermediate cardinality cascades through the query the oracle approach, which uses true cardinalities, provide tree and reduces the execution time for DISTINCT as well. similar performance. In experiments using Aspirin Count, Finally, the optimal approach applies its remaining budget we calculate the correlation coefficient between the true ex- to see a significant performance improvement. ecution time and the estimated execution time based on our In the Aspirin Count query, the three budget splitting ap- cost model as .998 for the circuit model and .931 for the proaches allocate the budget between three operators and RAM model, given a  = 0.5, δ = .00005. We see that provide an insight into the trade-off between early and late the Shrinkwrap cost model provides a reasonably accurate budget allocation. We see wildly varying execution times for prediction of the true cost. each of the operators, with different operators providing the bulk of the performance cost depending on the budget strat- 7.5 Operator Breakdown egy. Here, the optimal strategy gives the best performance. Now, we look at the execution time by operator to see The substantial variance in execution for these operators where Shrinkwrap provides the largest impact. We include demonstrates the value of the added accuracy that our I/O only private operators in the figure. For this experiment we cost model provides. use the Aspirin Count query with  = 0.5 and δ = 0.00005. We show the execution times for the baseline, fully padded 7.6 Join Scale Up approach for intermediate results, as well as the uniform, We now look at how Shrinkwrap scales with more com- eager, and optimal budget splitting approaches. plex workloads. From our experiments, we know that the Figure 8 shows the execution times of each operator in the number of joins in a query has an extremely large impact on Aspirin Count query. Note that Shrinkwrap does not apply execution time. More complex workloads typically contain any differential privacy until the output of Join M. (Join on more nested statements and require more advanced SQL Medications), so we see no speed-up in the actual Join M. processing, but can be broken down into a series of simpler operator, only in the later query operators. statements. This work focuses on complexity as a function The first operator to see a benefit from Shrinkwrap is of the join count in order to target the largest performance Join D. (Join on Demographics), the second join in the bottleneck in a single SQL statement. query. Shrinkwrap generates a differentially private car- In our experiment, we scale the join count and measure the dinality for the output of the previous join, Join M., ac- performance. Since Shrinkwrap applies differential privacy cording to the selected budget splitting approach. For uni- at the operator outputs, we measure the performance impact form, the allocated budget is not large enough to generate of join operators by looking at the execution time of their a differentially-private cardinality that is smaller than the immediate parent operator. For consistency, we truncate the fully padded cardinality. In fact, the overhead of calculat- number of non-inherited input tuples of each join to equal ing the differentially-private cardinality actually causes the magnitudes and use  = 0.5 and δ = 0.00005 for all runs. operator runtime to go above the baseline runtime. For ea- From Figure 9, we see how execution time scales as a ger, Shrinkwrap spends the entire budget on the output of function of the join count. As the number of joins increases,

317 1000000 8. RELATED WORK Baseline Shrinkwrap 100000 Within the literature, different approaches exist to im- prove query processing in private data federations, such as 10000 databases based on homomorphic encryption, TEEs, differ- ential privacy, and cloud computation [1, 29, 30, 36, 45, 54]. 1000 Shrinkwrap, on the other hand, provides general-purpose, hardware-agnostic, in-situ SQL evaluation with provable pri- 100 vacy guarantees and exact results. Execution Time (s) Time Execution 10 PINQ [36] introduced the first database with differen- tial privacy, along with privacy budget tracking and sen- 1 sitivity analysis for operator composition. We extend this 1 Join 2 Joins 3 Joins work by applying its privacy calculus to private data fed- Number of Joins erations. Follow on work in differential privacy appears in DJoin [38], where the system supports private execution of Figure 9: Aspirin Count with synthetic join scaling. Exe- certain SQL operators over a federated database with strong cuted using RAM model.  = 0.5, δ = .00005 privacy guarantees and noisy results. Shrinkwrap supports Without Shrinkwrap With Shrinkwrap a larger set of database operators for execution and instead 10000 of using noisy results to safeguard data, Shrinkwrap uses noisy cardinalities to improve performance. 1000 He et al. [22] applied computational differential privacy in join operators for private record linkage and proposed a three desiderata approach to operator execution: precision, 100 provability, and performance. Shrinkwrap incorporates this style of join execution and approach to execution trade-offs.

Execution Time (s) Time Execution 10 Pappas et al. [42] showed that by trading small bits of pri- vacy for performance within provable bounds using bloom filters, systems can provide scalable DBMS queries over ar- 1 1x 5x 10x 15x 20x 25x 30x 35x 40x 45x 50x bitrary boolean queries. Shrinkwrap applies this pattern of Synthetic Data Size, Above Baseline provable privacy versus performance trade-offs to the larger set of non-boolean arbitrary SQL queries. Figure 10: Aspirin Count with synthetic data scaling. Ex- Both Opaque [56] and Hermetic [54] use cost models to ecuted using Circuit model.  = 0.5, δ = .00005 estimate performance slowdowns due to privacy-preserving computation. In both cases, they use secure enclaves to the execution time also increases. To find the source of the carry out private computation, which provides constant- performance slowdown, we can look at the Shrinkwrap cost cost I/O. As such, their cost models cannot account for the model. For each join operator, the baseline private data fed- variable-cost I/O present in Shrinkwrap. eration fully pads the output. If the join inputs were both of size n, then the output would be n2. As the number of joins increase, the cardinality rises from n to n2 to n3. The exponential increase in output cardinality requires signifi- 9. CONCLUSIONS AND FUTURE WORK cantly more I/O accesses, which cause the large slow down In this work, we introduce Shrinkwrap, a protocol-agnostic in performance. Shrinkwrap applies differential privacy to system for differentially-private query processing that sig- reduce the output size at each join, reducing the magnitude nificantly improves the performance of private data federa- of each join output. While Shrinkwrap still sees a significant tions. We use a computational differential privacy mecha- growth in execution time as a function of the join count, the nism to selectively leak statistical information within prov- overall performance is orders of magnitudes better. able privacy bounds and reduce the size of intermediate re- sults. We introduce a novel cost model and privacy budget 7.7 Data Size Scale Up optimizer to maximize the privacy, performance, and accu- racy trade-off. We integrate Shrinkwrap into existing pri- We evaluate the performance of Shrinkwrap with increas- vate data federation architecture and collect results using ing data sizes. We leverage the higher efficiency of our real-world medical data. circuit model protocol, EMP [51], to examine the effect of In future research, we hope to further improve the per- Shrinkwrap on larger input data sizes. Without Shrinkwrap, formance of private data federations and the efficiency of the RAM model protocol, ObliVM [33], needs 1.3 hours to Shrinkwrap. We are currently investigating novel secure ar- run the Aspirin Count query and at 50x, 65 hours to com- ray algorithms and data structures to improve I/O access plete. Instead, we use EMP which completes in 15 minutes. time, privacy budget optimizations over multiple queries, Figure 10 shows the execution time with and without and extensions of private data federations using additional Shrinkwrap using an implementation of EMP [51] on our secure computation protocols and cryptographic primitives. Aspirin Count reference query. We generated synthetic data based on the original tables, giving us an effective maximum Acknowledgments: We thank Abel Kho, Satyender Goel, data size of 750 GB. In Figure 10, we see that Shrinkwrap Katie Jackson, and Jess Joseph Behrens for their guidance provides a significant improvement during execution. As and assistance with CAPriCORN and HealthLNK data. We the data size grows, our performance improvement grows as also thank the HealthLNK team for sharing de-identified well, reflecting the power and flexibility of Shrinkwrap. electronic health record data for this study.

318 10. REFERENCES [26] M. S. Islam, M. Kuzu, and M. Kantarcioglu. Access pattern [1] A. Arasu, S. Blanas, K. Eguro, R. Kaushik, D. Kossmann, disclosure on searchable encryption: Ramification, attack and R. Ramamurthy, and R. Venkatesan. Orthogonal Security with mitigation. In Network and Distributed System Security Cipherbase. CIDR, 2013. Symposium (NDSS, 2012. [2] A. Arasu and R. Kaushik. Oblivious Query Processing. ICDT, [27] N. Johnson, J. P. Near, and D. Song. Towards Practical pages 26–37, 2014. Differential Privacy for SQL Queries. PVLDB, 11(5):526–539, [3] G. Asharov and Y. Lindell. The BGW protocol for 2018. perfectly-secure multiparty computation. Cryptology and [28] P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg, Information Security Series, 10(189):120–167, 2013. M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom. [4] J. Bater, G. Elliott, C. Eggen, S. Goel, A. Kho, and J. Rogers. Spectre Attacks: Exploiting Speculative Execution. 2018. SMCQL: Secure Querying for Federated Databases. PVLDB, [29] S. Laur, R. Talviste, and J. Willemson. From oblivious AES to 10(6):673–684, 2017. efficient and secure database join in the multiparty setting. [5] D. Beaver, S. Micali, and P. Rogaway. The Round Complexity Lecture Notes in Computer Science, 7954 LNCS:84–101, 2013. of Secure Protocols. ACM Symposium on Theory of [30] S. Laur, J. Willemson, and B. Zhang. Round-efficient Oblivious Computing STOC, pages 503–513, 1990. Database Manipulation. ISC’11, pages 262–277, 2011. [6] M. Bellare, V. T. Hoang, S. Keelveedhi, and P. Rogaway. [31] M. Lipp, D. Gruss, R. Spreitzer, C. Maurice, and S. Mangard. Efficient garbling from a fixed-key blockcipher. In S&P, pages ARMageddon : Cache Attacks on Mobile Devices. USENIX 478–492. IEEE, 2013. Security, pages 549–564, 2016. [7] V. Bindschaedler, P. Grubbs, D. Cash, T. Ristenpart, and [32] C. Liu, Y. Huang, E. Shi, J. Katz, and M. Hicks. Automating V. Shmatikov. The tao of inference in privacy-protected Efficient RAM-Model Secure Computation. 2014 IEEE databases. PVLDB, 11(11):1715–1728, 2018. Symposium on Security and Privacy, pages 623–638, 2014. [8] D. Bogdanov, L. Kamm, S. Laur, P. Pruulmann-Vengerfeldt, [33] C. Liu, X. S. Wang, K. Nayak, Y. Huang, and E. Shi. ObliVM : R. Talviste, and J. Willemson. Privacy-preserving statistical A Programming Framework for Secure Computation. Oakland, data analysis on federated databases. In Annual Privacy pages 359–376, 2015. Forum, pages 30–55. Springer, 2014. [34] A. Machanavajjhala, D. Kifer, J. M. Abowd, J. Gehrke, and [9] G. Chen, S. Chen, Y. Xiao, Y. Zhang, Z. Lin, and T. H. Lai. L. Vilhuber. Privacy: Theory meets practice on the map. In SgxPectre Attacks: Leaking Enclave Secrets via Speculative ICDE, 2008. Execution. 2018. [35] R. McKenna, G. Miklau, M. Hay, and A. Machanavajjhala. [10] S. S. Chow, J.-H. Lee, and L. Subramanian. Two-party Optimizing error of high-dimensional statistical queries under computation model for privacy-preserving queries over differential privacy. PVLDB, 11(10):1206–1219, 2018. distributed databases. In NDSS, 2009. [36] F. D. McSherry. Privacy integrated queries: an extensible [11] C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and and M. platform for privacy-preserving data analysis. In 2009 ACM Zhu. Tools for Privacy Preserving Distributed Data Mining. SIGMOD, pages 19–30. ACM, 2009. ACM SIGKDD Explorations, 4(2), 2002. [37] I. Mironov, O. Pandey, O. Reingold, and S. Vadhan. [12] J. Doerner and A. Shelat. Scaling ORAM for Secure Computational differential privacy. In CRYPTO, 2009. Computation. Proceedings of the 2017 ACM SIGSAC [38] A. Narayan and A. Haeberlen. DJoin: Differentially private join Conference on Computer and Communications Security - queries over distributed databases. Proceedings of the 10th CCS ’17, pages 523–535, 2017. USENIX Symposium, page 14, 2012. [13] C. Dwork. Differential privacy. Proceedings of the 33rd [39] M. Naveed, C. V. Wright, S. Kamara, and C. V. Wright. International Colloquium on Automata, Languages and Inference Attacks on Property-Preserving Encrypted Programming, pages 1–12, 2006. Databases. In Proceedings of the 22nd ACM SIGSAC [14] C. Dwork and A. Roth. The algorithmic foundations of Conference on Computer and Communications Security, differential privacy. Found. Trends Theor. Comput. Sci., 2014. pages 644–655. ACM, 2015. [15] K. E. Emam. Heuristics for de-identifying health data. IEEE [40] K. Nayak, X. S. Wang, S. Ioannidis, U. Weinsberg, N. Taft, and Security & Privacy, 6(4):58–61, 2008. E. Shi. GraphSC : Parallel Secure Computation Made Easy. [16] U.´ Erlingsson, V. Pihur, and A. Korolova. Rappor: [41] V. Nikolaenko, S. Ioannidis, U. Weinsberg, M. Joye, N. Taft, Randomized aggregatable privacy-preserving ordinal response. and D. Boneh. Privacy-preserving matrix factorization. In In Proceedings of the 2014 ACM SIGSAC conference on Proceedings of the 2013 ACM SIGSAC conference on computer and communications security, pages 1054–1067. Computer & communications security, pages 801–812. ACM, ACM, 2014. 2013. [17] A. Ferraiuolo, R. Xu, D. Zhang, A. C. Myers, and G. E. Suh. [42] V. Pappas, F. Krell, B. Vo, V. Kolesnikov, T. Malkin, S. G. Verification of a Practical Hardware Security Architecture Choi, W. George, A. Keromytis, and S. Bellovin. Blind seer: A Through Static Information Flow Analysis. ACM SIGARCH scalable private DBMS. Proceedings - IEEE Symposium on Computer Architecture News, 45(1):555–568, 2017. Security and Privacy, pages 359–374, 2014. [18] O. Goldreich. Towards a theory of software protection and [43] PCORI. Characterizing the Effects of Recurrent Clostridium simulation by oblivious rams. In STOC, pages 182–194, New Difficile Infection on Patients. IRB Protocol, ORA: 14122, York, NY, USA, 1987. ACM. 2015. [19] O. Goldreich. Foundations of Cryptography: Volume 2, Basic [44] PCORI. Exchanging de-identified data between hospitals for Applications. Cambridge University Press, New York, NY, city-wide health analysis in the Chicago Area HealthLNK data USA, 2004. repository (HDR). IRB Protocol, 2015. [20] O. Goldreich, S. Micali, and A. Wigderson. How to Play Any [45] R. Popa and C. Redfield. CryptDB: protecting confidentiality Mental Game. Stoc ’87, pages 218–229, 1987. with encrypted query processing. SOSP, pages 85–100, 2011. [21] M. T. Goodrich and M. Mitzenmacher. Privacy-preserving [46] J. Saia and M. Zamani. Recent Results in Scalable Multi-Party access of outsourced data via oblivious RAM simulation. In Computation. 2014. ICALP, 2011. [47] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, [22] X. He, A. Machanavajjhala, C. Flynn, and D. Srivastava. and T. G. Price. Access path selection in a relational database Composing Differential Privacy and Secure Computation. In management system. In Proceedings of the 1979 ACM Proceedings of the 2017 ACM SIGSAC Conference on SIGMOD international conference on Management of data, Computer and Communications Security - CCS ’17, pages 23–34. ACM, 1979. number 1, pages 1389–1406, New York, New York, USA, 2017. [48] S. Vadhan. The Complexity of Differential Privacy. Springer [23] A. F. Hernandez, R. L. Fleurence, and R. L. Rothman. The International Publishing, 2017. ADAPTABLE Trial and PCORnet: shining light on a new [49] J. Van Bulck, M. Minkin, O. Weisse, D. Genkin, B. Kasikci, research paradigm. Annals of internal medicine, F. Piessens, M. Silberstein, T. F. Wenisch, Y. Yarom, 163(8):635–636, 2015. R. Strackx, and K. Leuven. Foreshadow: Extracting the Keys [24] Y. Huang, D. Evans, J. Katz, and L. Malka. Faster secure to the Intel SGX Kingdom with Transient Out-of-Order two-party computation using garbled circuits. In USENIX Execution. Proceedings of the 27th USENIX Security Security 2011, 2011. Symposium, pages 991–1008, 2018. [25] Y. Ishai, J. Kilian, K. Nissim, and E. Petrank. Extending [50] N. Volgushev, M. Schwarzkopf, B. Getchell, A. Lapets, oblivious transfers efficiently. In CRYPTO, LNCS, pages M. Varia, and A. Bestavros. Conclave Workflow Manager for 145–161, 2003. MPC, 2018.

319 [51] X. Wang, A. J. Malozemoff, and J. Katz. EMP-Toolkit: Efficient Multiparty Computation Toolkit. https://github.com/emp-toolkit, 2016. [52] X. Wang, A. J. Malozemoff, and J. Katz. Faster secure two-party computation in the single-execution setting. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 10212 LNCS:399–424, 2017. [53] X. S. Wang, K. Nayak, C. Liu, T.-H. H. Chan, E. Shi, E. Stefanov, and Y. Huang. Oblivious Data Structures. Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security - CCS ’14, pages 215–226, 2014. [54] M. Xu, A. Papadimitriou, A. Feldman, and A. Haeberlen. Using Differential Privacy to Efficiently Mitigate Side Channels in Distributed Analytics. In Proceedings of the 11th European Workshop on Systems Security - EuroSec’18, pages 1–6, New York, New York, USA, 2018. ACM Press. [55] A. C. Yao. Protocols for secure computations. In Proceedings of the 23rd Annual Symposium on Foundations of Computer Science, SFCS ’82, pages 160–164, Washington, DC, USA, 1982. IEEE Computer Society. [56] W. Zheng, A. Dave, J. G. Beekman, R. A. Popa, J. E. Gonzalez, and I. Stoica. Opaque: An oblivious and encrypted distributed analytics platform. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation, NSDI’17, pages 283–298, Berkeley, CA, USA, 2017. USENIX Association.

320