Privacy-preserving Statistical Estimation with Optimal Convergence Rates∗

[Extended Abstract]

† Adam Smith Department of and Engineering, Pennsylvania State University, University Park [email protected]

ABSTRACT Keywords Consider an analyst who wants to release aggregate Differential Privacy, Statistical Inference, Asymptotic Dis- about a data set containing sensitive information. Using tribution differentially private algorithms guarantees that the released statistics reveal very little about any particular record in the data set. In this paper we study the asymptotic properties 1. INTRODUCTION of differentially private algorithms for statistical inference. Private data analysis has recently emerged as a fascinating We show that for a large class of statistical estimators T field at the interface of algorithms, statistical learning, com- and input distributions P , there is a differentially private plexity theory and geometry. One specific notion of privacy, estimator AT with the same asymptotic distribution as T . differential privacy [DMNS06, Dwo06] has received signifi- That is, the random variables AT (X)andT (X)converge cant attention, largely because it is conceptual simple and in distribution when X consists of an i.i.d. sample from P yet offers meaningful guarantees in the presence of arbitrary of increasing size. This implies that AT (X) is essentially as external information. good as the original statistic T (X) for statistical inference, In this paper we consider the following basic problem: for sufficiently large samples. Our technique applies to (al- given a data set of sensitive information, what kinds of sta- most) any pair T,P such that T is asymptotically normal tistical inference and learning can we carry out without the on i.i.d. samples from P —in particular, to parametric max- results leaking sensitive information? For example, when imum likelihood estimators and estimators for logistic and can we guarantee that the published analysis of a clinical linear regression under standard regularity conditions. study’s outcomes will not leak details of an individual pa- A consequence of our techniques is the existence of low- tient’s treatment? space streaming algorithms whose output converges to the We provide, in some sense, a“free lunch”for data analysts: same asymptotic distribution as a given estimator T (for the we show that given a statistical estimator T that returns same class of estimators and input distributions as above). a relatively low-dimensional real vector, one can design a differentially private algorithm AT with the following rough guarantee: Categories and Subject Descriptors

F.2.0 [Theory of Computation]: Analysis Of Algorithms If the input X1,X2, ..., Xn is drawn i.i.d. (ac- And Problem Complexity—General;G.3[Mathematics of cording to distribution that need not be known Computing]: Probability And Statistics to AT ), and if T (X1, ..., Xn), appropriately rescaled, converges to a normal distribution as n grows, then AT (X1, ..., Xn) converges to the General Terms same distribution. Algorithms,Theory Differential privacy is parameterized by a number ε>0 ∗ A weaker version of the results of this paper appeared in (Definition 1). Our guarantee works even as ε tends to 0; we 1 an unpublished work by the same author [Smi08]. require only that ε be bounded above by a small polynomial † Supported by NSF PECASE Award CCF-0747294 and in the sample size n. NSF Award CCF-0729171 Our analysis requires additional conditions on the conver- gence of T ; these are met by essentially all the “asymptotic normality” results from classical statistical theory. Estima- tors which fit our assumptions include the maximum like- Permission to make digital or hard copies of all or part of this work for lihood estimator for “nice” parametric families of distribu- personal or classroom use is granted without fee provided that copies are tions; maximum-likelihood estimators for a host of regres- not made or distributed for profit or commercial advantage and that copies sion problems, including linear regression and logistic re- bear this notice and the full citation on the first page. To copy otherwise, to gression; estimators for the parameters of low-dimensional republish, to post on servers or to redistribute to lists, requires prior specific mixture-models with a fixed number of components, and, in permission and/or a fee. STOC’11, June 6–8, 2011, San Jose, California, USA. general, estimators for which T , viewed as a functional on Copyright 2011 ACM 978-1-4503-0691-1/11/06 ...$10.00. the space of distributions, is Fr´echet differentiable and has a non-zero derivative at the distribution P from which the sheds some light on the relationship between robustness and Xi are drawn. differential privacy. Another similarity between this work In several cases, notably for parametric estimation and and [DL09] is the use of the subsample-and-aggregate tech- regression, the estimators to which our result applies are nique of Nissim, Raskhodnikova and Smith [NRS07] as a known to have minimum possible variance among a large basic building block. class of estimators. In those cases, we give differentially Wasserman and Zhou [WZ10] consider differentially pri- private estimators whose variance is within a 1 + o(1) factor vate versions of several nonparametric density estimators. of optimal. In that sense, we provide a free lunch for private In general, the convergence rates they obtain are subopti- statistical inference: optimal accuracy (the same as the best mal and the algorithms they describe take exponential time nonprivate estimators) and strong guarantees of privacy. to run. However, the problem they tackle is essentially in- Our algorithms also provide one-pass, space-efficient algo- finite dimensional; it is unclear if the techniques we discuss rithms that compute an asymptotically similar statistic√ to a here can be applied in their setting. given statistic T . These algorithms use space O˜( n)when Finally, Chaudhuri, Monteleoni and Sarwate [CMS11] and T itself can be computed in quasilinear space. To the best of Rubinstein et al. [RBHT] consider differentially private al- our knowledge, previously studied estimators require linear gorithms tailored to algorithms that minimize a convex loss space to achieve the same guarantee. For a discussion of pre- function. Their convergence guarantees are incomparable to vious work on streaming algorithms for i.i.d. and randomly- ours, although it seems likely that their techniques outper- ordered data, see Chien, Ligett and McGregor [CLM10]. form ours in the specific settings for which they were de- signed. In particular, the dependency on the dimension of Previous Work. the problem is probably much better with their techniques The general question of the design of differentially private (our results require the dimension to be bounded above by algorithms has received much recent attention, too much a small polynomial in√n, whereas their techniques may work to survey here. We focus on work directly relevant to our for dimensions up to n). concerns. Blum et al. [BDMN05] considered a simple mech- anism which expresses a learning or estimation algorithm Basic Tools. in terms of a sequence of “statistical queries” (in the sense We design our differentially private algorithms by compos- of Kearns [Kea98]) to the data, and adds noise the result ing a number of existing tools from the literature, namely of each query. In most cases, it is difficult to see how this the Laplace mechanism of Dwork, McSherry, Nissim and reformulation and noise addition will affect the quality of Smith [DMNS06], the exponential mechanism of McSherry the final answer. For the class of exponential families of and Talwar [MT07] and the subsample-and-aggregate frame- probability distributions, the sufficient statistics for a fam- work of Nissim, Raskhodnikova and Smith [NRS07]. See ily can be directly computed via statistical queries, and so Section 2.1 for further discussion. the results of Blum et al. [BDMN05] (and follow-up work by Dwork et al. [DMNS06]) would apply. Unfortunately, many Preliminaries. probability models, e.g. mixture models, do not have such Given a random variable X,wedenoteitscumulativedis- simple sufficient statistics. tribution function FX . Similarly, given a probability mea- + Kasiviswanathan et al. [KLN 08] gave general inefficient sure P we use FP for the corresonding c.d.f. We sometimes algorithms with PAC learning guarantees for binary classifi- abuse notation and equate a random variable X with the cation problems. They show, roughly, that the private and corresponding . nonprivate sample complexities of PAC learning problems To compare probability distributions, we use two metrics. are polynomially related. The algorithms of [KLN+08] run First, given two distributions P, Q defined on the same un- in exponential time, in general, though which makes them derlying space of measurable sets, consider the statistical difficult to apply. In contrast, our algorithm AT runs in time difference (or total variation distance) essentially proportional to the time needed to compute the original statistic T . SD(P, Q)= sup |P (S) − Q(S)| . Most relevant to our work is the paper of Dwork and measurable S Lei [DL09], which drew inspiration from to One often thinks of the sets S as possible tests for distin- give differentially private estimators for scale, location and guishing P from Q. The definition states that no such set regression problems. Their estimators converge at a rate of − / c can distinguish between (draws from) P and Q with proba- n 1 2+ (where c>0 is a small constant) to the underlying 1 SD bility better than 2 + (P, Q). true parameters they were trying to recover. In contrast, our Rd −1/2 Second, for probability distributions on , we will use the estimators converge at the optimal rate of n in distribu- Kolmogorov-Smirnov distance, which can be thought of as tions similar to the ones they analyzed; our convergence rate a relaxed version of the statistical difference that considers even recovers the correct leading constant. Our techniques only axis-parallel rectangles as possible distinguishers: have a similar flavor to those of [DL09]. In particular, we modify a robust estimator of location, the Winsorized , KS(P, Q)= sup |P (R) − Q(R)| . to get an efficient 1 differentially private estimator for esti- rectanglesR mating the mean of a Gaussian distribution. Our analysis For one-dimensional distributions, this is supa,b∈R |P [a, b] − | 1In statistical theory, an estimator is called efficient if it has Q[a, b] . Note that the KS distance, as defined here, is closely the best possible convergence rate (among an appropriate related to the L∞ norm on the cumulative distribution func- class of estimators); that is, if it is efficient in its use of data. In this paper we have tried to carefully distinguish efficiency, but we apologize in advance if some ambiguities this meaning of “efficient” from the notion of computational remain. tions of P and Q.Namely,FP − FQ∞ ≤ KS(P, Q) ≤ assumptions. We dub statistics that exhibit these additional d 2 FP − FQ∞ where d is the dimension. properties generically asymptotically normal. Recally that a sequence of distributions Q1,Q2, ... on d D Definition 2 (Generic Normality). A statistic T : R converges in distribution to Q, denoted Qn−→ Q if ∗ → →∞ D → R is generically asymptotically normal at distribution FQn (x) FQ(x)asn , for all points x at which FQ is 2 continuous (this has several equivalent definitions in terms P if there exists a “true value” T (P ) and constant σP > 0 such that if X ,...,Xn ∼ P i.i.d., then of expectations of continuous functions). If FQ is continuous 1 everywhere, then convergence in distribution is equivalent to − T (X) √T (P ) −→D →∞ KS n → →∞ 1. (Normality) N(0, 1) as n , the condition that (Q ,Q) 0asn . We occasion- P D σ / n ally abuse notation and write Pn−→ Qn if all the Qn’s have − continuous c.d.f.’s and KS(Pn,Qn) tends to 0 as n grows. 2. (Linear Bias) E[T (X)] T (P )=O(1/n),and We extend the notions of statistical difference, KS dis- |T X −T P | 3. (Bounded third moment) E( ( ) √ ( ) )3 = O(1). tance and convergence in distribution to random variables σP / n D in the natural way. For example, we say Xn−→ Yn if Yn has This definition generalizes to d>1 by replacing σp with d×d a continuous c.d.f. for all n and the KS distance between a symmetric positive√ definite matrix ΣP ∈ R :werequire −1 d the distributions of Xn and Yn tends to 0. that Z(n)= nΣP (T (X) − T (P )) converges to N(0 ,I) and that the third moment condition holds for the projections . along any given direction (that is, for any unit vector u we In the sequel, D will denote a generic domain in which should have E[|uZ(n)|3]=O(1)). data points lie. D∗ is the set of all finite-length sequences of elements in D. Because order will generally not matter, we The second and third condition are necessary because will sometimes think of the elements in D∗ as multisets. asymptotic normality alone doesn’t imply that functionals We say two finite multisets (alternatively, sequences) of T (X), such as its expectation and variance, also converge x, x ⊆Dare neighbors if they differ in a single element: to the values one naturally expects. Asymptotic normality x x = 1. Differential privacy requires that inserting or implies only that the c.d.f. of T (X) is close, point-wise, to a removing one input tuple should change the distribution of Gaussian’s. However, because the range can be very large, inputs very little, as measured by parameters ε and δ. it could be that natural functionals of the distribution differ wildly from those of the Gaussian. For example, consider Definition 1 ([DMNS06, DKM+06]). Algorithm A a distribution Q which is a mixture of N(0, 1) with weight 2 is (ε, δ)-differentially private if for all neighboring data sets 1 − ρ,andN(1/ρ , 1) with weight ρ. The distance between x, x ⊆Dand for all events E, Q and N(0, 1) is ρ, but the expectation of Q goes to ∞ as ρ goes to 0. ε  Pr(A(x) ∈ E) ≤ e Pr(A(x ) ∈ E)+δ. The thin tails condition implies that this sort of strange behavior does not occur: for “nicely behaved” functions f, We write ε-differential privacy as shorthand for (ε, 0)- one gets that E(f(T (X))) ≈ E(f(Z)), where Z ∼ N(T (P ), differential privacy. 2 σP n ). Most of the algorithms in this paper are ε-differentially For examples of asymptotically normal distributions, we private (that is, they satisfy the definition above with δ =0). refer to the textbook of Schervish [Sch96]. Some examples: For simplicity, in most of the paper we will assume that • Given a family of distributions parametrized by a fi- the size n of the data set is publicly known. This assumption { | ∈ ⊆ Rd} can be removed at the expense of some technical complica- nite number of real numbers, Pθ θ Θ . Under tion by using a differentially private approximation to n;see mild regularity conditions, the maximum likelihood es- Section 2.4.1. timator is g.a.n. and has asymptotically optimal vari- D∗ → Rd ance among all (asymptotically) unbiased estimators. We consider statistics T : that return a vector § of d real numbers. Our main algorithm assumes that T in See, for example, [Sch96, 7.3.5]. This includes as a − Λ Λ d special case mixture models with a fixed number of fact takes values only in a bounded cube [ 2 , 2 ] ,where Λ > 0 is known to the algorithm. This assumption can be components which are described by a finite number of removed at the price achieving only (ε, δ) differential privacy. parameters but typically do not have sufficient statis- See Section 2.4.1 for details. tics that are any less compact than the original data set. Asymptotic Normality. • The estimators for common regression problems are The central limit theorem provides the best known exam- also generically asymptotically normal under mild con- ple of asymptotically normal statistics: any statistic which ditions (for example, the data matrix in linear and lo- is a sum of h(Xi) for some function h will converge to a gistic regression problems should not have very small normal distribution as long as h(Xi) has finite expectation eigenvalues). In particular, we get estimators for linear and variance. Many other statistics also exhibit this type and logistic regression coefficients with rate O(n−1/2), of convergence, however. Typically, a careful inspection of improving the n−1/2+γ (for a small constant γ> the convergence theorems reveals additional properties: the −1/2 0) convergence rate of the estimators of Dwork and standard deviation of these statistics usually scales as n ; Lei [DL09]. the bias of these statistics, surprisingly, shrinks more quickly, usually at a rate of 1/n; finally, additional moments of the • More generally, many statistics T can be viewed as asymptotic distribution can usually be bounded under mild functionals mapping the space of distributions (data sets are just distributions with finite support and ra- x Rd  PP tional probability masses) to . With an appropriate  ¡ PP topology on the space of distributions, one can define ) ¡ PPq x1,...,xn x n ,...,x2n ... x k− n ,...,xn the differentiability of T at a particular distribution k k +1 k ( 1) k +1 T . Differentiable statistics with non-zero derivative at P and satisfying mild moment conditions are generi- ? ? ? }}mm }m cally asymptotically normal. See Fernholz [Fer83] for ff f HH ¨¨ a compact exposition of the theory. H B ··· ¨ Hz1 z2 ¨ zk H B ¨ One striking example where our framework does not ap- H BN ¨¨ ply is the differentially private evaluation of fit of various Hj }m B models. Many statistics used to evaluate goodness of fit do not have asymptotically normal behavior. ?SA(x) 1.1 Main Theorem Figure 1: Subample-and-aggregate with a generic Theorem 3 (Main Theorem, Unbounded Range). aggregation algorithm B. We analyze two specific Given a statistic T : Dn → Rd,thereexistsa(ε, δ)- candidates for B: the sample and the noisy “widened Winsorized mean”. differentially private algorithm A = AT,ε,δ such that if T is generically asymptotically normal at distribution P ,the random variables A(X) and T (X) converge in distribution 2.1 Subsample and Aggregate when X is an i.i.d. sample of size n from P , that is, Nissim et al. [NRS07] introduced the subsample-and- KS(A(X),T(X)) → 0 as n →∞. aggregate framework for “smoothing” a function f, defined on D∗, to obtain a randomized algorithm which is differ- Moreover, there is a constant c>0 such that convergence 1 entially private. When f is sufficiently well-behaved, the continues to hold even if d, ε, Λ change with n as long as d, ε c resulting algorithm outputs a value close to that of f with and log( 1 ) are all at most n . δ high-probability. This theorem in fact follows from a slightly different result, The idea is to randomly partition the input x ∈Dn into which assumes a known bound on the range of T but achieves k blocks of size roughly n/k each. The function f is ap- a stronger privacy guarantee. The reduction from one form plied to each block to obtain k estimates z1, ..., zk. Finally, to the other of the main theorem is given in Section 2.4.1. these estimates are aggregated using a differentially private Theorem 4 (Main Theorem, Bounded Range). function B: Dn → − Λ Λ d Given a statistic T : [ , ] ,thereexistsa f,B,k n k− n n 2 2 SA (x)=B(f(x1, ..., x k ), ...., f(x( 1) k +1, ..., x )) , (ε, 0)-differentially private algorithm A = AT,ε,Λ such that if T is generically asymptotically normal at distribution as depicted in Figure 1. Nissim et al. observed that this P , the random variables A(X) and T (X) converge in construction is always differentially private, regardless of the distribution when X is an i.i.d. sample from P , that is, structure of f: KS(A(X),T(X)) → 0 as n →∞. Lemma 5 (Privacy of SA [NRS07]). Let f be an ar- D∗ Moreover, there is a constant c>0 such that convergence bitrary (possibly randomized) function with inputs in f and 1 taking values in a domain DB .IfB is ε-differentially private continues to hold even if d, ε, δ change with n as long as d, ε ∗ c for inputs in DB ,thenSAf,B,k is ε-differentially private for and log(Λ) are all at most n . ∗ inputs in Df . Because the KS distance bounds the difference between the cumulative distribution functions of two random vari- The intuition behind the framework is the following: if f ables, the theorem implies that A(X)andT (X)convergein gives reasonably consistent answers on data sets of size n/k,  −  → →∞ distribution: FA(X) FT (X) ∞ 0asn ,whereFY then the aggregation function can return a value close to denotes the c.d.f. of random variable Y . the expected value of f on a random subset of x of size n/k. Note that we provide no bound on the rate at which the Roughly: SAf should obtain the same accuracy as the best KS distance converges to 0 in the theorems above. Such a nonprivate algorithm could obtain on data sets of a smaller bound would require further assumptions about T (namely, size, n/k. about how quickly T itself converges to the normal distribu- What we show in this paper is that a careful application tion). of the framework gives answers that are as good — up to We focus here simply on establishing convergence. Note a1+o(1) factor — as the best nonprivate algorithm can that the theorem is meaningful even with a very slow conver- obtain on datasets of thesamesize. gence rate. Once the KS distance is below a small constant, say 1/20, the error of A in estimating T (P )isat 2.2 A Nonprivate Estimator: Aggregation via most 1.05 times greater than the median error of the non- Averaging private estimator T . If we replace the aggregation algorithm B in the subsample-and-aggregate algorithmP with a simple average, 1 k 2. THE ESTIMATORS i that is, if we outputz ¯ = k i=1 z , then we get an estima- In this section we explain the estimator we use and state tor that is not differentially private (in particular, it may be the main claims about their behavior. Proofs and more de- deterministic) but which is asymptotically identical to T (X) tailed analysis are deferred to the next section. when T is asymptotically normal. Because we assume our data is drawn i.i.d., the partitioning into blocks need not estimator, and the extrapolation to asymptotic convergence, be random. We use the natural, consecutive partition for is novel. simplicity. We denote the estimator 2.3 Corollary: Streaming Estimators Ave Xk Note that k,T can be evaluated in a single pass over the 1 Avek,T (x , ..., xn)=¯z = T (x i− n , ..., xi n ) . data set using memory at most n/k plus whatever memory 1 k ( 1) k +1 k i=1 requirements are needed to evaluate T on inputs of size n/k: one need only remember the values xj for the current block Lemma 6 (Convergence of Averaging). Suppose and a running sum of the√zi’s computed so far. Setting k to that√ T is generically asymptotically normal at P .Ifk = be slightly smaller than n, we obtain: o( n), d is bounded above by a sufficiently small polynomial Theorem 7 (Streaming statistical estimators). in n and X1, ..., Xn are drawn i.i.d. from P ,then Let T : D∗ → Rd be a statistic that is computable in linear √ D · Ave −→ 2 →∞ ˜ Ave √ n k,T (X) N(T (P ),σP ) as n . space and polynomial time. T (X)= n ,T (x) can be √ log n computed in one pass with O˜( n) space. If T is g.a.n. at P and X is an i.i.d. sample from P of size n,then Proof Idea: Separating Bias and Variance. D The basic intuition for the analysis is that the averaging T˜(X)−→ T (X) as n →∞. step reduces the variance drastically, but does not reduce To get some idea of the meaning of this result, consider bias. As long as the bias of the individual estimators is the naive low-space estimator which estimates T (P )bycom- sufficiently low, however, we can still get convergence to the √ puting T on a single subsample of size n from the data set. right distribution. The standard deviation of this estimator will be on the order First, since the sample X is i.i.d. from P ,theZi terms of n−1/4. In contrast, Theorem 7 gives an estimator with are themselves i.i.d. and (by asymptotic normality) each Zi −1/2 2 standard deviation O(n ). kσP is close in distribution to N(T (P ), n ). This result has a quite different flavor from existing work Crudely, one would expect then that the average of the on low-space statistical estimators, since it aims for opti- Zi’s would be also close to Gaussian with the same mean 2 mal error and allows for a polynomial amount of space. σP T (P ) and bias reduced by a factor of k down to n . As mentioned in the introduction, previous work [GM07, This doesn’t quite work, but it comes close. The variance CK09, CLM10] focused on regimes with higher error and calculation is essentially correct. One can show that the lower (generally polylogarithmic in n) space. An interesting kσ2 P direction, which we leave to future work, is understanding variance of each Zi is close to n .SincetheZi’s are in- ¯ 1 whether our analysis and techniques could also lead to im- dependent, the variance of the average Z is exactly k times the variance of each of the terms, which gives provements in lower space regimes. 2 2 1 kσP σP 2.4 Making the Estimator Private: Widened Var(Z¯) ≈ · = . k n n Winsorized Mean On the other hand, the expectation of Z¯ is exactly the We can make the estimator above differentially private expectation of the Zi’s, without division by k. The linear by replacing the empirical average with a differentially pri- bias condition of generic asymptotic normality ensures that vate aggregate W (z1, ..., zn) which closely approximates the average as long as its input z is an i.i.d. sample from a E(Zi)=√ T (P ) ± O(k/n). When k is asymptotically smaller than n,weget distribution close to the normal distribution with bounded third moment. − / E(Z¯)=T (P ) ± o(n 1 2) . A Winsorized mean rounds to in a data set to

−1/2 lie within a fixed interval, usually defined in terms of some This is not exactly the right expectation, but the o(n ) quantiles of the data set. An α-Winsorized mean rounds the bias term is swamped by the variance and so we get Z¯ ≈ 2 αk smallest values up to Z(αk) and rounds the αk largest σP 1 N(T (P ), n ). values down to Z((1−α)k). For example, when α = 4 one To make this argument formal, we use the bound on “squeezes” the data set to lie in its own interquartile range. the third moment of the Zi’s to convert between expecta- In order to get both differential privacy and statistical tion/variance calculations and convergence in distribution. efficiency, we adapt this idea in two key ways: The details are explained in Section 3.1. First, we replace the true quartiles with differentially pri- vate estimates. There are several ways to estimate quar- Relation to the Bootstrap and Jackknife. tiles; we found an “exponential-mechanism”-based method The “subsample-and-average” technique above belongs to due to McSherry and Talwar [MT07] as well as personal a family of resampling techniques (of which popular mem- communication) to be most amenable to our analysis (the bers are the bootstrap and jackknife estimators) used in “smoothed sensitivity”approach of Nissim et al. [NRS07] and statistics to estimate the distribution of a statis- the“propose-test-release” approach of Dwork and Lei [DL09] tic. The textbook of Politis, Romano and Wolf [PRW99] also work here but make for messier theorem statements). describes the general theory of these techniques (see Chap- For completeness, PrivateQuantile is described in Algo- ter 2, for example, for the treatment of the i.i.d. setting). rithm 2. The specific variant we use here, in which each observation Second, we widen the Winsorization interval: we scale the gets used in exactly one subsample, is not standard. More interval√ up by a factor rad (a parameter we set to be roughly importantly, to our knowledge our analysis of the bias of the 3 k). This scaling allows us to capture enough values in the tail to get optimal variance but confines the data to a small are projected is likely to contain all the points in Z;this enough interval that we can add a relatively small amount latter event impliesμ ˆ(Z)=Z¯. of noise and still get differential privacy. Sinceμ ˆ(Z)=Z¯ are actually equal with high probabil- The resulting algorithm W is described in Algorithm 1. ity, their distributions are close in statistical difference. We Note that the privacy guarantee does not make any distri- immediately get that KS(ˆμ, Z¯) is small since statistical dif- butional assumptions about the input. ference upper bounds KS distance. Second, since under the same conditions, we know that Z¯ is close to normal, adding 2.4.1 Simplifying Assumptions noise Y with variance much smaller than that of Z will not In the remainder of the analysis, we make two simplifying change the distribution significantly. We obtain the follow- assumptions. ing key corollary: First, we assume that the size of the database, n,isa publicly known value. In general, though, it may not be de- Corollary 10 (Accuracy of WWM). Let W be the sirable to release n; we can instead first estimate n using the noisy widened Winsorized mean described in Algorithm 1. Laplace mechanism [DMNS06] and then use that estimate There exists a constant β>0 such that for every random |Z−μ| in place of n in the remaining algorithms. The data set can variable Z such that KS(Z, N(μ, σ2)) ≤ 1 and E( )3 ≤   20 σ be made to have length n = n either by inserting a small C,ifZ1, ..., Zk are i.i.d. copies of Z,then number of default values or removing a small number of val- ues. This will not change the asymptotics of the estimators KS ¯ ≤ −β −Ω(εk) analyzed here. (W (Z), Z) Ck +Λe . Second, we assume that the estimator T always takes val- In particular, if k goes to infinity and C, 1 and log(Λ) are − Λ Λ d ε ues in a known range [ 2 , 2 ] . One can circumvent this in bounded above by sufficiently small polynomials in k,then at least two ways. The first is to monotonically rescale each KS(Z, N(μ, σ2)) goes to 0. of the real axes to take only values in (−1/2, 1/2) (say via → 1 the map x π arctan(x)). However, this has the net effect 2.5 Putting the Pieces Together of drastically increasing the error for parameter values that are far away from the origin. Another approach is to use the We can combine the guarantees on W together with the ¯ Ave differentially private estimators of scale and location due to fact, shown above, that Z = k,T (X) is asymptotically Dwork and Lei [DL09] inside W to obtain a crude interval equivalent to T (X)(whenT is generically asymptotically that contains a (1 − o(1)) fraction of the data set and which normal at the distribution of X). The overall estimator AT is described in Algorithm 3. is not too much larger than σP . Because the final error de- pends only logarithmically on the radius Λ of the interval, To get the final result, we must set k carefully in AT to even a significant overestimate of the range of the data will balance two factors: if k is too large, then the bias of the not affect the final answer significantly. We defer the details estimators on the individual subsamples becomes too big to the full version of the paper. and dominates the variance of the estimator. If k is too small, however, the noise added to ensure the privacy of W becomes too large, and again the overall error becomes too Analysis of WWM. large. We set k = n1/2−η where η>0 is a small constant. Lemma 8. The algorithm W is ε-differentially private. This allows us to prove the bounded-range version of the main result. Proof. The algorithm is a composition of three differentially-private mechanisms: two calls to PrivateQuan- Proof of Theorem 4. We have set k = n1/2−η so that tile (with parameter ε/4 each) and one application of the the analysis of the convergence of simple averaging applies. KS ¯ σP Laplace mechanism (with parameter ε/2). Overall, the al- We get that (Z,N(T (P ), n )) tends to 0 (by Lemma 6) gorithm is ε-differentially private by the triangle inequal- and that KS(W (Z), Z¯) tends to 0 (by Lemma 9). Since ity. the the KS distance satisfies the triangle inequality, we KS σP ≤ KS σP ¯ have that (W (Z),N(T (P ), n ) (N(T (P ), n ), Z)+ We now turn to the more involved analysis of the efficiency KS(Z,W¯ (Z)) tends to 0 as n grows to ∞. The same argu- W (Z) as an estimator of the mean. ment works in higher dimensions by a straightforward hybrid Lemma 9. Let W be the noisy widened Winsorized mean argument, although the distance bounds degrade by a factor described in Algorithm 1. Let μˆ(Z) be the initial estimate of d. of Z¯ that is computed by the algorithm, and let Y be the Laplace noise that is added to μˆ(Z). There exists a con- As mentioned earlier, the unbounded range version of the stant β>0 such that for every random variable Z such that result follows by first obtaining a crude estimate of Λ using 2 1 |Z−μ| 3 the technique of Dwork and Lei [DL09]. KS ≤ ≤ k (Z, N(μ, σ )) 20 and E( σ ) C,ifZ1, ..., Z are i.i.d. copies of Z, then the following holds: There exists an event E of probability at least 1 − Ck−β − 3. DETAILED ANALYSIS Λe−Ω(εk) such that conditioned on E: 3.1 Analysis of Simple Averaging 1. μˆ(Z)=Z¯,and ≤ ¯ · −β Proof of Lemma 6. Consider first what happens when 2. Var(Y ) Var(Z) k . 2 d =1.Letρn = KS(T (X),N(T (P ),σP /n)). By the nor- The event E in the preceding lemma is roughly that the mality assumption, ρn → 0. quantile estimates obtained in the first part of W are accu- By the linear bias assumption, E(Z¯)=E(Z1)=T (P )+ rate. In that case, the interval [ , u] into which observations O(k/n). Algorithm 1: Widened Winsorized Mean W k Input: Z =(Z1,...,Zk) ∈ R , parameter ε>0, bounding parameter Λ > 0 Output:Estimateˆμ = W (Z) 1 +η Set the parameter rad = k 3 ,whereη =1/10. /* First, estimate the range of Z with [ , u].*/ /* Use the exponential mechanism to estimate quantiles. */ ← 1 ε aˆ PrivateQuantile(Z, 4 , 4 , Λ); ˆ ← 3 ε b PrivateQuantile(Z, 4 , 4 , Λ); aˆ+ˆb crude ← μ 2 ; iqrcrude ←|ˆb − aˆ| ; u ← μcrude +4· rad · iqrcrude; ← μcrude − 4 · rad · iqrcrude;

/* Now compute Winsorized8 mean for range [ , u].*/ <> if x ≤ Define Π ,u (x) ← x if u if x>u Xk 1 Letμ ˆ ← Π ,u (Zi); k [ ] i=1 ∼ Lap |u− | Lap Sample Y ( 2εk )where (λ) is a Laplace distribution with scale parameter λ; Output W (Z)=ˆμ + Y

Algorithm 2: PrivateQuantile(Z, α, ε)

Input:ListofrealnumbersZ =(Z1, ..., Zk), quantile α ∈ (0, 1), privacy parameter ε>0, bounding parameter Λ > 0 Output:Arealnumberx ∈ [0, Λ] drawn according to the distribution with density proportional to ε − | − { i ≤ }| exp( 2 αk # i : Z x ). Sort Zi in ascending order; Replace Zi < 0with0andZi > ΛwithΛ; Define Z0 =0andZk+1 =Λ; For i =0, ..., k,setyi =(Zi+1 − Zi)exp(−ε|i − αk|). Pk ∈{ } i i Sample an integer i 0, ..., k with probability y /( i=0 y ); Output a uniform draw from Zi+1 − Zi.

Computing the variance is slightly more delicate since, realizations of a random variable A with mean 0,variance1 kσ2 3 P and third absolute moment c = E[|A| ] < ∞ and let G be a even though the Zi’s are close to N(T (P ), n ), the vari- √ 2 KS · ¯ ≤ kσP N(0√, 1) Gaussian random variable. Then ( n A, G) ance of Zi may differ from n . Nevertheless, the bound 9c/ n. on the third moment of the Zi’s (which is ensured by generic asymptotic normality assumption) allows us to con- Zi−E(Z) Applying the Berry-Ess´een theorem to ,weget clude that the variance is indeed close to what we expect. Var(Z) Z¯−E(Z ) Z¯−E(Z ) D that Yn = √1 = 1 √ −→ N(0, 1). Lemma 14 in Appendix A implies that (1+o(1))σP / n Var(Z1)/ n/k 1 Dropping the 1+o(1) term in the denominator, we get that ¯ ¯ Var(Z)= Var(Z1)  Z−E(Z1)  Yn = √ converges to N(0, 1) (since Yn =(1+o(1))Yn, k σP / n 2 2  1 σP 1/3 σP we get that for any real value y,Pr(Yn ≤ y)=Pr(Yn ≤ y(1+ ≤ ( )(1 + O(ρn/k)) = (1 + o(1)) .  k n/k n o(1))); this value converges to Pr(Yn ≤ y)sinceYn converges to a continuous distribution). Finally, we can replace E(Z ) ¯ 1 Now Z is a sum of i.i.d. random variables. The central limit with T (P ) ± O(k/n): theorem states that such sums converge to a normal distri- bution. The difficulty here is that we are taking two limits  Z¯ − T (P ) ± O(k/n) Z¯ − T (P ) k Yn = √ = √ (1 + O( √ )) at the same time: the distribution of the Zi’s also changes σP / n σP / n n with k. Fortunately, the Berry-Esseen theorem gives a uni- √ √k form bound on the convergence rate in terms of the third Since k = o( n), the factor (1 + O( n )) is (1 + o(1)). Z¯−T P D moment of the Zi’s (which are bounded by the conditions Hence, √( ) −→ N(0, 1), as desired. of generic asymptotic normality). The following statement σP / n The proof extends directly to higher d using higher- is paraphrased from Feller [Fel71, §16.5]: dimensional analogues of the central limit theorems and

Theorem 11 (Berry-Esseen).´ Let A1, ..., Ak be i.i.d. Berry-Ess´een bounds. Algorithm 3: AT : Subsample-and-aggregate using Widened Winsorized mean n d Input: X =(X1,...,Xn) in domain D. Description of T : D → R Output:EstimateA(X)toT (X) 1 −η Set k = n 2 where η =1/10. Randomly divide X into k blocks X(1),...,X(k) of size n/k each; (i) Compute Zi = T (X )foreachblocki =1, 2,...,k. for each dimension j =1,...,d do /* Run noisy Widened Winsorized Mean in dimension j */ Z|j ← projection of Z =(Z1,...,Zk)indimensionj; ← | ε Aj W (Z j , d );

Output (A1,...,Ad).

3.2 Analysis of Widened Winsorized Mean We analyze here the case of the upper estiamte ˆb;thecase The analysis of W hinges on getting good estimates of of the lower estimatea ˆ is symmetric. the quartiles. The following lemma shows the estimates are First, we can assume that μ =0andσ = 1 by rescaling − μ μ+Λ accurate enough (that is, the interquartile range is correct the interval [0, Λ] to [ σ , σ ]. within a constant factor) with high probability: Second, we can assume that the empirical c.d.f. of Z1, ..., Zk, denoted Fˆz, follows FZ closely. By the Lemma 12. Let Z =(Z1, ..., Zk) be i.i.d. draws from dis- Dvoretsky-Kifer-Wolfowitz theorem, the probability that 1 2 || ˆZ − Z ||∞ tribution Q.IfQ is within KS distance 20 of N(μ, σ),then, F F >βis exponentially small in β k. We’ll as- − Λ −Ω(εk) − sume the c.d.f.’s differ by at most β =1/1000; this occurs with probability 1 σ e over the choice of Z,bothμ aˆ ˆ − 1 with probability 1 − exp(Ω(k)). Thus, the overall distance and b μ are in the interval [ 8 σ, 2σ]. between FˆZ and F is at most ρ + β. This lemma is proved in the next section. For now, we Our argument follows the by-now standard outline of anal- need only the following corollary: yses of the exponential mechanism [MT07]. Divide the real line into segments delim- − − − Corollary 13. If we define , u as μcrude ± 4(ˆb − aˆ)rad, ited by F 1(11/20),F 1(14/20),F 1(16/20), and then the probability over the choice of Z that both [ , u] con- F −1(19/20). Let BAD be the set of points outside of ± rad | − | rad − Λ −Ω(εk) −1 −1 tains μ σ and u < 16σ is 1 σ e . [F (11/20),F (19/20)] and let GOOD be the set of points inside [F −1(14/20),F−1(16/20)]. We can now prove the main lemma on the accuracy of the Consider the distribution sampled from by PrivateQuan- noisy widened Winsorized mean. tile. Note that points in BAD have density in the dis- − ε 4k − − Proof of Lemma 9. Let E1 be the event described in tribution at most exp( 2 ( 20 ρ β))/M where M is a Corollary 13. Conditioned on E1, the standard deviation normalizing constant. Points in GOOD have mass at least σrad rad 1/3 − ε 1k of Y is O( εk ). Since we set = k + η, the standard exp( 2 ( 20 + ρ + β))/M . Moreover, the measure of BAD − − 2 η 1 − 3 + is at most Λ/σ; the measure of GOOD is F (16/20) pdeviation of Y is O(σk ), which is much smaller than −1 σ F (14/20). Thus, the probability of a point in BAD being Var(Z¯)= √ (1 + o(1)) (see the analysis of the simple k sampled by PrivateQuantile is at most averaging estimator for the bound on the variance of Z¯). Let E2 be the event that all the points in Z lie in the (Λ/σ)exp(− ε ( 4k − ρ − β)) interval μ ± σrad. The bound on the third moment of Z 2 20 (F −1(16/20) − F −1(14/20)) exp(− ε ( 1k + ρ + β)) implies that 2 20 R Λ ε 3k |Z −μ| = O( ) · exp(− ( − 2ρ − 2β)) . Pr( 1 ≤ rad)= dQ(μ + σw) σ 2 20 σ Z |w|>rad −3 3 −3 Recalling that ρ<1/20 and β<1/1000, we get the desired ≤ rad |w| dQ(μ + σw)=C · rad . |w|>rad bound.

3 −3η hence the probability of E2 is at most kC/rad = Ck . To complete the proof of the lemma, let E = E1 ∩E2.The Acknowledgements probability of E is close to 1 since each of E1 and E2 occurs This work benefited from conversations with many col- with probability close to 1. Moreover, E2 and E1 together leagues from both statistics and computer science over the imply that all the points in Z lie in [ , u], and soμ ˆ(Z)=Z¯. last three years. In particular, I am grateful for helpful dis- The event E1 alone implies that the Laplace noise added to cussions with Cynthia Dwork, Sofya Raskhodnikova, Steve the estimate is not too large, as desired. Fienberg, Sesa Slavkovi´c, Bing Li, Jing Lei, Andrew McGre- 3.3 Analysis of PrivateQuantile gor and Larry Wasserman. Proof of Lemma 12. The constants 1/8 and 2 are cho- −1 ≈ 4. REFERENCES sen to be less than F (11/20) 0.126 and more than [BDMN05] Avrim Blum, Cynthia Dwork, Frank McSherry, and −1 F (19/20) ≈ 1.65, respectively, where F is the c.d.f. of . Practical privacy: The SuLQ N(0, 1). framework. In PODS, pages 128–138. ACM, 2005. [CK09] Kevin L. Chang and Ravi Kannan. Pass-efficient Proof. Let Q be the distribution of Z. Without loss algorithms for learning mixtures of uniform of generality, we can assume μ =0andsigma =1;the 2 distributions. SIAM J. Comput., 39(3):783–812, general result follows by rescaling. Var(Z) ≤ EQ(Z) = R ∞ 2009. (z)2dQ(z). We can break this interval into two pieces [CLM10] Steve Chien, Katrina Ligett, and Andrew z=−∞ R R ∞ 2 − ∞ 2 McGregor. Space-efficient estimation of robust and write it as w=0 w dQ( w)+ w=0 w dQ(w). We will statistics and distribution testing. In Innovations in show that each of these is pieces very close to σ2/2, which Computer Since (ICS), pages 251–265, 2010. is the value of the analogous integral for a N(0, 1) standard [CMS11] K. Chaudhuri, C. Monteleoni, and A.D. Sarwate. Gaussian random variable. Consider the component w>0; Differentially private empirical risk minimization. the corresponding argument for w<0 is symmetric. Let N Journal of Machine Learning Research, 2011. [DKM+06] Cynthia Dwork, Krishnaram Kenthapadi, Frank be the distribution of the Gaussian: McSherry, Ilya Mironov, and . Our data, Z Z Z ourselves: Privacy via distributed noise generation. ∞ ∞ ∞ 2 2 2 2 In EUROCRYPT, LNCS, pages 486–503. Springer, w dQ(w) − σ = w dQ(w) − w dN(w) 2006. w=0 w=0 w=0 [DL09] Cynthia Dwork and Jing Lei. Differential privacy Given a bound T>0, which we will set later, we can and robust statistics. In STOC ’09: Proceedings of break both integrals into two pieces, one for wT) is bounded [Kea98] Michael Kearns. Efficient noise-tolerant learning using the moment condition: from statistical queries. Journal of the ACM, Z ∞ Z ∞ 45(6):983–1006, 1998. Preliminary version in 2 ≤ s−2 s s−2 proceedings of STOC’93 . w dQ(w) T w dQ(w)=CT . w=T w=T [KLN+08] Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Finally, the same bound applies to the analogous integral Smith. What can we learn privately? In FOCS, for the normal, as long as C at least as large√ as the third pages 531–540. IEEE Computer Society, 2008. (absolute) moment of the normal, which is 2 2π. [MT07] Frank McSherry and Kunal Talwar. Mechanism Putting the pieces together, and multiplying by two to design via differential privacy. In FOCS, pages account for the contribution of w<0, we get 94–103. IEEE, 2007. s− [NRS07] Kobbi Nissim, Sofya Raskhodnikova, and Adam |Var(Z) − σ2|≤2(ρT 2 +2CT 2) . Smith. Smooth sensitivity and sampling in private ρ −1/s | − 2|≤ data analysis. In STOC, pages 75–84. ACM, 2007. Setting T =(C ) ,wegetthat Var(Z) σ [PRW99] D.N.Politis,J.P.Romano,andM.Wolf. 2 s−2 6C s ρ s , as desired. Subsampling. Springer-Verlag, 1999. [RBHT] Benjamin I. P. Rubinstein, Peter L. Bartlett, Ling Lemma 15 (Writing expectations using CDFs). Huang, and Nina Taft. Learning in a large function space: Privacy-preserving mechanisms for svm Let f : R → R be differentiable on [0, ∞] such that f(0) = 0 learning. arXiv:0911.5708v1 [cs.LG]. and let Z be a nonnegative random variable. Then [Sch96] Mark Schervish. Theory of Statistics. Springer, 1996. Z ∞  [Smi08] Adam Smith. Efficient, differentially private point EQ(f(Z)) = f (t)Q(Z ≥ t)dt , estimators. CoRR, abs/0809.4794, 2008. t=0 [WZ10] Larry Wasserman and Shuheng Zhou. A statistical as long as theR integral on the right converges absolutely (that framework for differential privacy. Journal of the ∞  is, as long as |f (t)|Q(Z ≥ t)dt < ∞). American Statistical Association, 105(489):375–389, t=0 Mar 2010. arXiv:0811.2501v1 [math.ST]. Note that if Q has bounded support and f is bounded on the support of Q, then the integral on the right-hand side APPENDIX will always converge absolutely. A. VARIANCE BOUNDS

Lemma 14. Let Z be drawn from a distribution within KS |Z−μ| s ≤ distance ρ of N(μ, σ) and such that E( σ ) C for some √ 2 s−2 s>2 and C ≥ 2 2π.ThenVar(Z)=σ2(1±6C s ρ s ).In particular, when s =3and C is constant, we get Var(Z)= σ2(1 ± O(ρ1/3)).