The Structure of Optimal Private Tests for Simple Hypotheses∗

The Structure of Optimal Private Tests for Simple Hypotheses∗ Clément L. Canonne Gautam Kamath Audra McMillan Stanford University Simons Institute for the Theory of Boston University and Northeastern USA Computing University [email protected] USA USA [email protected] [email protected] Adam Smith Jonathan Ullman Boston University Northeastern University USA USA [email protected] [email protected] ABSTRACT of Computing (STOC ’19), June 23–26, 2019, Phoenix, AZ, USA. ACM, New Hypothesis testing plays a central role in statistical inference, and York, NY, USA, 12 pages. https://doi.org/10.1145/3313276.3316336 is used in many settings where privacy concerns are paramount. This work answers a basic question about privately testing simple 1 INTRODUCTION hypotheses: given two distributions P and Q, and a privacy level ε, how many i.i.d. samples are needed to distinguish P from Q subject Hypothesis testing plays a central role in statistical inference, anal- to ε-differential privacy, and what sort of tests have optimal sample ogous to that of decision or promise problems in computability and complexity? Specifically, we characterize this sample complexity complexity theory. A hypothesis testing problem is specified by two up to constant factors in terms of the structure of P and Q and the disjoint sets of probability distributions over the same set, called H H privacy level ε, and show that this sample complexity is achieved by hypotheses, 0 and 1. An algorithm T for this problem, called a a certain randomized and clamped variant of the log-likelihood ratio hypothesis test, is given a sample x from an unknown distribution ¹ º test. Our result is an analogue of the classical Neyman–Pearson P, with the requirement that T x should, with high probability, 2 H 2 H lemma in the setting of private hypothesis testing. We also give output “0” if P 0, and “1” if P 1. There is no requirement H [H an application of our result to the private change-point detection. for distributions outside of 0 1. In computer science, such Our characterization applies more generally to hypothesis tests problems sometimes go by the name distribution property testing. satisfying essentially any notion of algorithmic stability, which Hypothesis testing problems are important in their own right, as is known to imply strong generalization bounds in adaptive data they formalize yes-or-no questions about an underlying population analysis, and thus our results have applications even when privacy based on a randomly drawn sample, such as whether education is not a primary concern. strongly influences life expectancy, or whether a particular med- ical treatment is effective. Successful hypothesis tests with high CCS CONCEPTS degrees of confidence remain the gold standard for publication in top journals in the physical and social sciences. Hypothesis testing • Mathematics of computing → Hypothesis testing and con- problems are also important in the theory of statistics and machine fidence interval computation; • Security and privacy; • The- learning, as many lower bounds for estimation and optimization ory of computation → Design and analysis of algorithms; problems are obtained by reducing from hypothesis testing. KEYWORDS This paper aims to understand the structure and sample complexity of optimal hypothesis tests subject to strong privacy guarantees. differential privacy, hypothesis testing Large collections of personal information are now ubiquitous, but ACM Reference Format: their use for effective scientific discovery remains limited bycon- Clément L. Canonne, Gautam Kamath, Audra McMillan, Adam Smith, and Jonathan cerns about privacy. In addition to the well-understood settings of Ullman. 2019. The Structure of Optimal Private Tests for Simple Hypotheses. data collected during scientific studies, such as clinical experiments In Proceedings of the 51st Annual ACM SIGACT Symposium on the Theory and surveys, many other data sources where privacy concerns are ∗In memory of Stephen E. Fienberg (1942–2016). A full version of this paper is available paramount are now being tapped for socially beneficial analysis, as [19]. such as Social Science One [70], which aims to allow access to data collected by Facebook and similar companies. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed We study algorithms that satisfy differential privacy (DP) [32], a for profit or commercial advantage and that copies bear this notice and the full citation restriction on the algorithm that ensures meaningful privacy guar- on the first page. Copyrights for components of this work owned by others than the antees against an adversary with arbitrary side information [47]. author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission Differential privacy has come to be the de facto standard forthe and/or a fee. Request permissions from [email protected]. analysis of private data, used as a measure of privacy for data STOC ’19, June 23–26, 2019, Phoenix, AZ, USA analysis systems at Google [36], Apple [25], and the U.S. Census © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-6705-9/19/06...$15.00 Bureau [24]. Differential privacy and related distributional notions https://doi.org/10.1145/3313276.3316336 of algorithmic stability can be crucial for statistical validity even 310 STOC ’19, June 23–26, 2019, Phoenix, AZ, USA Canonne, Kamath, McMillan, Smith, Ullman when confidentiality is not a direct concern, as they provide gener- problems is delicate. We know there are regimes where statisti- alization guarantees in an adaptive setting [29]. cal problems can be solved privately “for free” asymptotically (e.g. Consider an algorithm that takes a set of data points from a set [20, 32, 45, 69]) and others where there is a significant cost, even for X—where each point belongs to some individual—and produces relaxed definitions of privacy (e.g. [15, 35]), and we remain far from some public output. We say the algorithm is differentially private a general characterization of the statistical cost of privacy. Duchi, if no single data point can significantly impact the distribution on Jordan, and Wainwright [26] give a characterization for the special outputs. Formally, we say two data sets x; x 0 2 Xn of the same size case of simple tests by local differentially private algorithms, a more are neighbors if they differ in at most one entry. restricted setting where samples are randomized individually, and Definition 1.1 ([32]). A randomized algorithm T taking inputs in the test makes a decision based on these randomized samples. Our X∗ and returning random outputs in a space with event set S is characterization in the general case is more involved, as it exhibits ε-differentially private if for all n ≥ 1, for all neighboring data sets several distinct regimes for the parameter ε. x; x 0 2 Xn, and for all events S 2 S, P »T ¹xº 2 S¼ ≤ eε P »T ¹x 0º 2 S¼. Our analysis relies on a number of tools of independent interest: For the special case of tests returning output in f0; 1g, the output a characterization of private hypothesis testing in terms of cou- Xn distribution is characterized by the probability of returning “1”. plings between distributions on , and a novel interpretation of Letting д¹xº = P »T ¹xº = 1¼, we can equivalently require that Hellinger distance as the advantage over random guessing of a specific, randomized likelihood ratio test. д¹xº 1 − д¹xº ε max 0 ; 0 ≤ e : д¹x º 1 − д¹x º The Importance of Simple Hypotheses. Many of the hypothe- For algorithms with binary outputs, this definition is essentially ses that arise in application are not simple, but are so-called com- equivalent to all other commonly studied notions of privacy and dis- posite hypotheses. For example, deciding if two features are inde- tributional algorithmic stability (see “Connections to Algorithmic pendent or far from it involves sets H0 and H1 each containing Stability”, below). many distributions. Yet many of those tests can be reduced to sim- Contribution: The Sample Complexity of Private Tests for ple ones. For example, deciding if the mean of a Gaussian is less Simple Hypotheses. We focus on the setting of i.i.d. data and than 0 or greater than 1 can be reduced to testing if the mean is either 0 or 1. Furthermore, simple tests arise in lower bounds for singleton hypotheses H0; H1, which are called simple hypotheses. estimation—the well-known characterization of parametric estima- The algorithm is given a sample of n points x1;:::; xn drawn i.i.d. from one of two distributions, P or Q, and attempts to determine tion in terms of Fisher information is obtained by showing that n n the Fisher information measures variability in the Hellinger dis- which one generated the input. That is, H0 = fP g and H1 = fQ g. We investigate the following question. tance and then employing the Hellinger-based characterization of nonprivate simple tests (e.g. [12, Chap. II.31.2, p.180]). Given two distributions P and Q and a privacy param- Our characterization of private testing implies similar lower eter ε > 0, what is the minimum number of samples P;Q bounds for estimation (along the lines of lower bounds of Duchi (denoted SCε ) needed for an ε-differentially private and Ruan [27] in the local model of differential privacy). test to reliably distinguish P from Q, and what are optimal private tests? Connection to Algorithmic Stability. For hypothesis tests with These questions are well understood in the classical, nonprivate constant error probabilities, sample complexity bounds for differ- setting.

The Structure of Optimal Private Tests for Simple Hypotheses∗

Curriculum Vitae

Limitations and Possibilities of Algorithmic Mechanism Design By

Satisfiability and Evolution

A Decade of Lattice Cryptography

6.5 X 11.5 Doublelines.P65

Linked Decompositions of Networks and the Power of Choice in Polya Urns

Sharing Non-Anonymous Costs of Multiple Resources Optimally

Pairwise Independence and Derandomization

Identifying the Roles of Race-Based Choice and Chance in High School Friendship Network Formation

Preliminary Program (Ver

Christos-Alexandros (Alex) Psomas

The Algorithmic Foundations of Differential Privacy