Arxiv:1809.07750V3 [Cs.CR] 4 May 2021 Ible with Mechanisms Like This One—It Is Impossible to Run Analysis of the Data
Total Page:16
File Type:pdf, Size:1020Kb
CHORUS: a Programming Framework for Building Scalable Differential Privacy Mechanisms Noah Johnson Joseph P. Near Joseph M. Hellerstein Dawn Song UC Berkeley University of Vermont UC Berkeley UC Berkeley [email protected] [email protected] [email protected] [email protected] Abstract—Differential privacy is fast becoming the gold stan- [64], [67] to special-purpose analytics tasks such as graph dard in enabling statistical analysis of data while protecting analysis [24], [42], [46], [47], [72], linear queries [8], [25], the privacy of individuals. However, practical use of differ- [41], [50]–[53], [68], [81], [82], [85], and analysis of data ential privacy still lags behind research progress because streams [31], [74]. research prototypes cannot satisfy the scalability require- Despite extensive academic research and an abundant ments of production deployments. To address this challenge, supply of mechanisms, differential privacy has not been we present CHORUS, a framework for building scalable dif- widely adopted in practice. Existing applications of dif- ferential privacy mechanisms which is based on cooperation ferential privacy in practice are limited to specialized use between the mechanism itself and a high-performance pro- cases [2], [34]. duction database management system (DBMS). We demon- A major challenge for the practical adoption of differ- strate the use of CHORUS to build the first highly scal- ential privacy is the ability to deploy differential privacy able implementations of complex mechanisms like Weighted mechanisms at scale. Today’s data analytics infrastruc- PINQ, MWEM, and the matrix mechanism. We report on tures include industrial-grade database management sys- our experience deploying CHORUS at Uber, and evaluate its tems (DBMSs) carefully tuned for performance and reli- scalability on real-world queries. ability, designed to process datasets consisting of billions of rows. Index Terms—privacy, differential privacy, SQL queries, The simplest mechanisms for differential privacy, like query rewriting, security the Laplace mechanism [30], answer an analyst’s query by adding noise to the final result of the query. This 1. Introduction mechanism can be easily deployed atop an existing high- performance DBMS by leveraging the DBMS to execute Organizations are collecting more and more sensitive the analyst’s query, then adding the right amount of noise information about individuals. As this data is highly valu- to the result. Since it uses the DBMS to perform the actual able for a broad range of business interests, organizations data processing tasks, this approach scales well. Some ex- are motivated to provide analysts with flexible access to isting work, such as FLEX [44], takes this approach, which the data. At the same time, the public is increasingly we call the post-processing architecture. For appropriate concerned about privacy protection. There is a growing mechanisms, the post-processing architecture solves the and urgent need for technology solutions that balance scalability problem. these interests by supporting general-purpose analytics However, more advanced differential privacy mecha- while guaranteeing privacy protection. nisms require fundamental changes to the way queries exe- Differential privacy [28], [32] is widely recognized by cute. Summation queries, for example, require clipping the experts as the most rigorous theoretical solution to this data before summing it to control the influence of outliers. problem. Differential privacy provides a formal guarantee The post-processing approach is fundamentally incompat- of privacy for individuals while allowing general statistical arXiv:1809.07750v3 [cs.CR] 4 May 2021 ible with mechanisms like this one—it is impossible to run analysis of the data. In short, it states that the presence or the analyst’s query unmodified and then achieve differen- absence of any single individual’s data should not have a tial privacy by post-processing the results. Unfortunately, large effect on the results of a query. This allows precise the vast majority of recently-developed mechanisms fall answers to questions about populations in the data while into this category (e.g. [8], [24], [25], [41], [42], [46], guaranteeing the results reveal little about any individual. [47], [50]–[53], [68], [72], [81], [82], [85]), and cannot Unlike alternative approaches such as anonymization and be implemented using the post-processing architecture. k-anonymity, differential privacy protects against a wide range of attacks, including attacks using auxiliary infor- As a result, no scalable implementation exists for mation [27], [61], [65], [77]. many of the exciting differential privacy mechanisms de- Current research on differential privacy focuses on veloped in recent years. The implementations which have development of new algorithms, called mechanisms, to been developed (e.g. [49], [55], [59], [67], [80]) modify achieve differential privacy for a particular class of or replace the DBMS with a custom engine, which is un- queries. Researchers have developed dozens of mecha- likely to offer performance on par with modern production nisms covering a broad range of use cases, from general- DBMSs. purpose statistical queries [19], [30], [54], [55], [59], The CHORUS Framework. This paper describes CHO- 1 RUS, a framework for developing and deploying cutting- tributions: edge differential privacy mechanisms at scale. CHO- 1) We present the CHORUS framework, which enables a RUS makes it easy to develop mechanism implementa- novel cooperative architecture for implementing dif- tions which work in cooperation with an existing high- ferential privacy mechanisms atop high-performance performance DBMS, even for mechanisms which require DBMSs (§ 3). modifying queries or generating entirely new ones. CHO- 2) We demonstrate CHORUS’s flexibility by developing RUS supports scalable implementations by leveraging the scalable implementations for a number of advanced DBMS for data processing tasks, instead of custom code. differential privacy mechanisms (§ 5). We call this the cooperative architecture for differential privacy mechanisms. 3) We release CHORUS as open source [3] and describe CHORUS provides a programming framework to sup- how to deploy it to provide differential privacy in port implementing mechanisms in the cooperative ar- production settings (§ 6). chitecture. The framework has three major components: 4) We report on our experience deploying CHORUS to rewriting, for modifying queries to perform functions enforce differential privacy at Uber, where it processes like clipping; analysis, for analyzing queries to determine more than 10,000 queries per day (§ 6). properties like how much noise is required for differential 5) We demonstrate the scalability of CHORUS by evaluat- privacy; and post-processing, for processing the results of ing it on 18,774 real-world queries with a database of executing queries. To implement a summation mechanism 300 million rows (§ 7). with clipping, for example, we can use CHORUS’s rewrit- ing component to modify the analyst’s query so that the 2. Background DBMS performs the clipping as well as the summation. With this modification, the rest of the mechanism can Differential privacy provides a formal guarantee of be implemented via analysis and post-processing of the indistinguishability. This guarantee is defined in terms of rewritten query. a privacy budget —the smaller the budget, the stronger CHORUS supports integration with any standard SQL the guarantee. The formal definition of differential privacy database. The framework is designed to facilitate working is written in terms of the distance d(x; y) between two directly with SQL queries, since SQL is the most com- databases, i.e. the number of entries on which they differ: monly used language for high-performance production d(x; y) = jfi : xi 6= yigj. Two databases x and y DBMSs. By using a standard SQL database engine instead are neighbors if d(x; y) = 1. A randomized mechanism of a custom runtime, CHORUS can leverage the reliability, K : Dn ! R preserves (, δ)-differential privacy if for scalability and performance of modern databases, which any pair of neighboring databases x; y 2 Dn and set S of are built on decades of research and engineering experi- possible outputs: ence. The cooperative architecture applies to all of the Pr[K(x) 2 S] ≤ e Pr[K(y) 2 S] + δ recently-developed differential privacy mechanisms—even Differential privacy can be enforced by adding noise to ones that require significant changes to the way queries the non-private results of a query. The scale of this noise execute or generate entirely new queries. We demon- depends on the sensitivity of the query. The global sensi- strate the flexibility of the approach, and of the CHORUS tivity of a query f : Dn ! R is defined as: framework, by implementing both simple mechanisms like GSf = max jf(x) − f(y)j summation with clipping and complex mechanisms like x;y:d(x;y)=1 wPINQ, MWEM, and the matrix mechanism. In all of Importantly, differential privacy mechanisms satisfy a se- these implementations, CHORUS supports scalability by quential composition property: if F satisfies ( ; δ )- moving data processing tasks to the DBMS. 1 1 1 differential privacy, and F2 satisfies (2; δ2)-differential Deployment. We have made CHORUS available as open privacy, then running both F1 and F2 satisfies (1+2; δ1+ source software [3], and it is designed for integration δ2)-differential privacy. For more on differential privacy, in production environments.