Differential Privacy at Risk: Bridging and Privacy Budget Ashish Dandekar, Debabrota Basu, Stéphane Bressan

To cite this version:

Ashish Dandekar, Debabrota Basu, Stéphane Bressan. Differential Privacy at Risk: Bridging Ran- domness and Privacy Budget. Proceedings on Privacy Enhancing Technologies, De Gruyter Open, 2021, 2021 (1), pp.64-84. ￿10.2478/popets-2021-0005￿. ￿hal-02942997￿

HAL Id: hal-02942997 https://hal.inria.fr/hal-02942997 Submitted on 18 Sep 2020

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Proceedings on Privacy Enhancing Technologies ..; .. (..):1–21

Ashish Dandekar*, Debabrota Basu*, and Stéphane Bressan Differential Privacy at Risk: Bridging Randomness and Privacy Budget

Abstract: The calibration of noise for a privacy- 1 Introduction preserving mechanism depends on the sensitivity of the query and the prescribed privacy level. A data steward Dwork et al. [12] quantify the privacy level ε in ε- must make the non-trivial choice of a privacy level that differential privacy (or ε-DP) as an upper bound on the balances the requirements of users and the monetary worst-case privacy loss incurred by a privacy-preserving constraints of the business entity. mechanism. Generally, a privacy-preserving mechanism Firstly, we analyse roles of the sources of randomness, perturbs the results by adding the calibrated amount of namely the explicit randomness induced by the noise random noise to them. The calibration of noise depends distribution and the implicit randomness induced by on the sensitivity of the query and the specified pri- the data-generation distribution, that are involved in vacy level. In a real-world setting, a data steward must the design of a privacy-preserving mechanism. The finer specify a privacy level that balances the requirements analysis enables us to provide stronger privacy guaran- of the users and monetary constraints of the business tees with quantifiable risks. Thus, we propose privacy entity. For example, Garfinkel et al. [14] report on is- at risk that is a probabilistic calibration of privacy- sues encountered when deploying differential privacy as preserving mechanisms. We provide a composition the- the privacy definition by the US census bureau. They orem that leverages privacy at risk. We instantiate the highlight the lack of analytical methods to choose the probabilistic calibration for the Laplace mechanism by privacy level. They also report empirical studies that providing analytical results. show the loss in utility due to the application of privacy- Secondly, we propose a cost model that bridges the gap preserving mechanisms. between the privacy level and the compensation budget We address the dilemma of a data steward in two estimated by a GDPR compliant business entity. The ways. Firstly, we propose a probabilistic quantification convexity of the proposed cost model leads to a unique of privacy levels. Probabilistic quantification of privacy fine-tuning of privacy level that minimises the compen- levels provides a data steward with a way to take quan- sation budget. We show its effectiveness by illustrat- tified risks under the desired utility of the data. We refer ing a realistic scenario that avoids overestimation of the to the probabilistic quantification as privacy at risk. We compensation budget by using privacy at risk for the also derive a composition theorem that leverages privacy Laplace mechanism. We quantitatively show that com- at risk. Secondly, we propose a cost model that links the position using the cost optimal privacy at risk provides privacy level to a monetary budget. This cost model stronger privacy guarantee than the classical advanced helps the data steward to choose the privacy level con- composition. Although the illustration is specific to the strained on the estimated budget and vice versa. Con- chosen cost model, it naturally extends to any convex vexity of the proposed cost model ensures the existence cost model. We also provide realistic illustrations of how of a unique privacy at risk that would minimise the bud- a data steward uses privacy at risk to balance the trade- get. We show that the composition with an optimal pri- off between utility and privacy. vacy at risk provides stronger privacy guarantees than Keywords: Differential privacy, cost model, Laplace the traditional advanced composition [12]. In the end, mechanism we illustrate a realistic scenario that exemplifies how the

DOI Editor to enter DOI Received ..; revised ..; accepted ...

*Corresponding Author: Debabrota Basu: Dept. of Computer Sci. and Engg., Chalmers University of Technology, Göteborg, Sweden, E-mail: [email protected] *Corresponding Author: Ashish Dandekar: DI ENS, Stéphane Bressan: National University of Singapore, Singa- ENS, CNRS, PSL University & Inria, Paris, France, E-mail: pore, E-mail: [email protected] [email protected] Differential Privacy at Risk 2 data steward can avoid overestimation of the budget by the optimal privacy at risk, which is estimated using the using the proposed cost model by using privacy at risk. cost model, with traditional composition mechanisms – The probabilistic quantification of privacy levels de- basic and advanced mechanisms [12]. We observe that pends on two sources of randomness: the explicit ran- it gives stronger privacy guarantees than the ones ob- domness induced by the noise distribution and the im- tained by the advanced composition without sacrificing plicit randomness induced by the data-generation distri- on the utility of the mechanism. bution. Often, these two sources are coupled with each In conclusion, benefits of the probabilistic quantifi- other. We require analytical forms of both sources of cation i.e., of the privacy at risk are twofold. It not randomness as well as an analytical representation of only quantifies the privacy level for a given privacy- the query to derive a privacy guarantee. Computing the preserving mechanism but also facilitates decision- probabilistic quantification of different sources of ran- making in problems that focus on the privacy-utility domness is generally a challenging task. Although we trade-off and the compensation budget minimisation. find multiple probabilistic privacy definitions in the lit- erature [16, 27] 1, we miss an analytical quantification bridging the randomness and privacy level of a privacy- 2 Background preserving mechanism. We propose a probabilistic quan- tification, namely privacy at risk, that further leads to We consider a universe of datasets D. We explicitly men- analytical relation between privacy and randomness. We tion when we consider that the datasets are sampled derive a composition theorem with privacy at risk for from a data-generation distribution G with support D. mechanisms with the same as well as varying privacy Two datasets of equal cardinality x and y are said to be levels. It is an extension of the advanced composition neighbouring datasets if they differ in one data point. A theorem [12] that deals with a sequential and adaptive pair of neighbouring datasets is denoted by x ∼ y. In use of privacy-preserving mechanisms. We also prove this work, we focus on a specific class of queries called that privacy at risk satisfies convexity over privacy levels numeric queries. A numeric query f is a function that and a weak relaxation of the post-processing property. k maps a dataset into a real-valued vector, i.e. f : D → R . To the best of our knowledge, we are the first to ana- For instance, a sum query returns the sum of the values lytically derive the proposed probabilistic quantification in a dataset. for the widely used Laplace mechanism [10]. In order to achieve a privacy guarantee, researchers The privacy level proposed by the differential pri- use a privacy-preserving mechanism, or mechanism in vacy framework is too abstract a quantity to be inte- short, which is a randomised algorithm that adds noise grated in a business setting. We propose a cost model to the query from a given family of distributions. that maps the privacy level to a monetary budget. The Thus, a privacy-preserving mechanism of a given fam- proposed model is a convex function of the privacy level, ily, M(f, Θ), for the query f and the set of parame- which further leads to a convex cost model for privacy ters Θ of the given noise distribution, is a function i.e. at risk. Hence, it has a unique probabilistic privacy level M(f, Θ) : D → R. In the case of numerical queries, R is that minimises the cost. We illustrate this using a real- k R . We denote a privacy-preserving mechanism as M, istic scenario in a GDPR-compliant business entity that when the query and the parameters are clear from the needs an estimation of the compensation budget that it context. needs to pay to stakeholders in the unfortunate event of a personal data breach. The illustration, which uses Definition 1 (Differential Privacy [12]). A privacy- the proposed convex cost model, shows that the use of preserving mechanism M, equipped with a query f and probabilistic privacy levels avoids overestimation of the with parameters Θ, is (ε, δ)-differentially private if for compensation budget without sacrificing utility. The il- all Z ⊆ Range(M) and x, y ∈ D such that x ∼ y: lustration naturally extends to any convex cost model. ε In this work, we comparatively evaluate the privacy P(M(f, Θ)(x) ∈ Z) ≤ e × P(M(f, Θ)(y) ∈ Z) + δ. guarantees using privacy at risk of the Laplace mecha- nism. We quantitatively compare the composition under An (ε, 0)-differentially private mechanism is also simply said to be ε-differentially private. Often, ε-differential privacy is referred to as pure differential privacy whereas (ε, δ)-differential privacy is referred as approximate dif- 1 A widely-used (ε, δ)-differential privacy is not a probabilistic relaxation of differential privacy [29]. ferential privacy. Differential Privacy at Risk 3

A privacy-preserving mechanism provides perfect pri- 3 Privacy at Risk: A Probabilistic vacy if it yields indistinguishable outputs for all neigh- bouring input datasets. The privacy level ε quantifies Quantification of Randomness the privacy guarantee provided by ε-differential privacy. For a given query, the smaller the value of the ε, the The parameters of a privacy-preserving mechanism are qualitatively higher the privacy. A randomised algo- calibrated using the privacy level and the sensitivity of rithm that is ε-differentially private is also ε0-differential the query. A data steward needs to choose an appro- private for any ε0 > ε. priate privacy level for practical implementation. Lee In order to satisfy ε-differential privacy, the param- et al. [25] show that the choice of an actual privacy eters of a privacy-preserving mechanism requires a cal- level by a data steward in regard to her business re- culated calibration. The amount of noise required to quirements is a non-trivial task. Recall that the privacy achieve a specified privacy level depends on the query. level in the definition of differential privacy corresponds If the output of the query does not change drastically to the worst case privacy loss. Business users are how- for two neighbouring datasets, then a small amount of ever used to taking and managing risks, if the risks can noise is required to achieve a given privacy level. The be quantified. For instance, Jorion [21] defines Value at measure of such fluctuations is called the sensitivity of Risk that is used by risk analysts to quantify the loss in the query. The parameters of a privacy-preserving mech- investments for a given portfolio and an acceptable con- anism are calibrated using the sensitivity of the query fidence bound. Motivated by the formulation of Value that quantifies the smoothness of a numeric query. at Risk, we propose to use the use of probabilistic pri- vacy level. It provides us with a finer tuning of an ε0- Definition 2 (Sensitivity). The sensitivity of a query differentially private privacy-preserving mechanism for k f : D → R is defined as a specified risk γ.

∆f max kf(x) − f(y)k1. , x,y∈D Definition 5 (Privacy at Risk). For a given data gen- x∼y erating distribution G, a privacy-preserving mecha- The Laplace mechanism is a privacy-preserving mecha- nism M, equipped with a query f and with parame- nism that adds scaled noise sampled from a calibrated ters Θ, satisfies ε-differential privacy with a privacy at Laplace distribution to the numeric query. risk 0 ≤ γ ≤ 1 if, for all Z ⊆ Range(M) and x, y sam- pled from G such that x ∼ y: Definition 3 ([35]). The Laplace distribution with   P(M(f, Θ)(x) ∈ Z) mean zero and scale b > 0 is a probability distribution P ln > ε ≤ γ, (1) P(M(f, Θ)(y) ∈ Z) with probability density function where the outer probability is calculated with respect to 1  |x| Lap(b) exp − , the probability space Range(M ◦ G) obtained by apply- , 2b b ing the privacy-preserving mechanism M on the data- where x ∈ R. We write Lap(b) to denote a random vari- generation distribution G. able X ∼ Lap(b) If a privacy-preserving mechanism is ε0-differentially Definition 4 (Laplace Mechanism [10]). Given any private for a given query f and parameters Θ, for k function f : D → R and any x ∈ D, the Laplace any privacy level ε ≥ ε0, the privacy at risk is 0. We Mechanism is defined as are interested in quantifying the risk γ with which an ε0-differentially private privacy-preserving mechanism  ∆  L∆f (x) M f, f (x) = f(x) + (L , ..., L ), also satisfies a stronger ε-differential privacy, i.e., with ε , ε 1 k ε < ε0.   ∆f th where Li is drawn from Lap ε and added to the i component of f(x). Unifying Probabilistic and Random DP

∆f Interestingly, Equation (1) unifies the notions of proba- Theorem 1 ([10]). The Laplace mechanism, Lε , is 0 bilistic differential privacy and random differential pri- ε -differentially private. 0 vacy by accounting for both sources of randomness in a privacy-preserving mechanism. Machanavajjhala et Differential Privacy at Risk 4 al. [27] define probabilistic differential privacy that in- over multiple evaluations with a square root dependence corporates the explicit randomness of the noise distribu- on the number of evaluations. In this section, we provide tion of the privacy-preserving mechanism, whereas Hall the composition theorem for privacy at risk. et al. [16] define random differential privacy that incor- porates the implicit randomness of the data-generation Definition 6 (Privacy loss random variable). For a distribution. In probabilistic differential privacy, the privacy-preserving mechanism M : D → R, any two outer probability is computed over the sample space of neighbouring datasets x, y ∈ D and an output r ∈ R, the Range(M) and all datasets are equally probable. value of the privacy loss random variable C is defined as: P(M(x) = r) C(r) , ln . P(M(y) = r) Connection with Approximate DP Despite a resemblance with probabilistic relaxations of Lemma 1. If a privacy-preserving mechanism M sat- differential privacy [13, 16, 27] due to the added param- isfies ε0-differential privacy, then eter δ, (ε, δ)-differential privacy (Definition 1) is a non- probabilistic variant [29] of regular ε-differential privacy. P[|C| ≤ ε0] = 1. Indeed, unlike the auxiliary parameters in probabilis- Theorem 3. For all ε , ε, γ, δ > 0, the class of ε - tic relaxations, such as γ in privacy at risk (ref. Def- 0 0 differentially private mechanisms, which satisfy (ε, γ)- inition 5), the parameter δ of approximate differential privacy at risk under a uniform data-generation distri- privacy is an absolute slack that is independent of the bution, are (ε0, δ)-differential privacy under n-fold com- sources of randomness. For a specified choice of ε and position where δ, one can analytically compute a matching value of δ 2 r for a new value of ε . Therefore, as other probabilistic 1 ε0 = ε 2n ln + nµ, relaxations, privacy at risk cannot be directly related 0 δ to approximate differential privacy. An alternative is to where µ = 1 [γε2 + (1 − γ)ε2]. find out a privacy at risk level γ for a given privacy level 2 0 (ε, δ) while the original noise satisfies (ε0, δ). Proof. Let, M1...n : D → R1 × R2 × ... × Rn denote the Theorem 2. If a privacy preserving mechanism satis- n-fold composition of privacy-preserving mechanisms i i n i fies (ε, γ) privacy at risk, it also satisfies (ε, γ) approxi- {M : D → R }i=1. Each ε0-differentially private M mate differential privacy. also satisfies (ε, γ)-privacy at risk for some ε ≤ ε0 and appropriately computed γ. Consider any two neighbour- We obtain this reduction as the probability measure ing datasets x, y ∈ D. Let, induced by the privacy preserving mechanism and ( n i ) ^ (M (x) = ri) data generating distribution on any output set Z ⊆ B = (r , ..., r ) P > ε 1 n i e 3 P(M (y) = ri) Range(M) is additive. The proof of the theorem is i=1 in Appendix A. Using the technique in [12, Theorem 3.20], it suffices to ...n show that P(M1 (x) ∈ B) ≤ δ. Consider 3.1 Composition theorem (M1...n(x) = (r , ..., r )) ln P 1 n 1...n The application of ε-differential privacy to many real- P(M (y) = (r1, ..., rn)) n i world problem suffers from the degradation of privacy Y P(M (x) = ri) = ln i guarantee, i.e., privacy level, over the composition. The (M (y) = ri) i=1 P basic composition theorem [12] dictates that the pri- n n (Mi(x) = r ) vacy guarantee degrades linearly in the number of eval- X P i X i = ln i , C (2) (M (y) = ri) uations of the mechanism. The advanced composition i=1 P i=1 theorem [12] provides a finer analysis of the privacy loss where Ci in the last line denotes the privacy loss random variable related to Mi. 2 For any 0 < ε0 ≤ ε, any (ε, δ)-differentially private mechanism Consider an ε-differentially private mechanism Mε 0 ε ε0 also satisfies (ε , (e − e + δ))-differential privacy. and ε0-differentially private mechanism Mε0 . Let Mε0 3 The converse is not true as explained before. satisfy (ε, γ)-privacy at risk for ε ≤ ε0 and appropriately Differential Privacy at Risk 5 computed γ. Each Mi can be simulated as the mech- A detailed discussion and analysis of proving such het- anism Mε with probability γ and the mechanism Mε0 erogeneous composition theorems is available in [22, otherwise. Therefore, the privacy loss random variable Section 3.3]. for each mechanism Mi can be written as In fact, if we consider both sources of randomness, the expected value of the loss function must be com- Ci = γCi + (1 − γ)Ci ε ε0 puted by using the law of total expectation.

i where Cε denotes the privacy loss random variable as- E[C] = Ex,y∼G[E[C]|x, y] i sociated with the mechanism Mε and C denotes the ε0 Therefore, the exact computation of privacy guaran- privacy loss random variable associated with the mech- tees after the composition requires access to the data- M anism ε0 . Using [5, Remark 3.4], we can bound the generation distribution. We assume a uniform data- mean of every privacy loss random variable as: generation distribution while proving Theorem 3. We 1 can obtain better and finer privacy guarantees account- µ [Ci] ≤ [γε2 + (1 − γ)ε2]. , E 2 0 ing for data-generation distribution, which we keep as a future work. We have a collection of n independent privacy random i  i  variables C ’s such that P |C | ≤ ε0 = 1. Using Hoeffd- ing’s bound [18] on the sample mean for any β > 0, 3.2 Convexity and Post-Processing " # 1 X  nβ2  Ci ≥ [Ci] + β ≤ exp − . We show that privacy at risk satisfies the convexity P n E 2ε2 i 0 property and does not satisfy the post-processing prop- Rearranging the inequality by renaming the upper erty. bound on the probability as δ, we get: Lemma 2 (Convexity). For a given ε0-differentially " r # X 1 private privacy-preserving mechanism, privacy at risk Ci ≥ nµ + ε 2n ln ≤ δ. P 0 δ satisfies the convexity property. i

Proof. Let M be a mechanism that satisfies ε0- differential privacy. By the definition of the privacy at Theorem 3 is an analogue, in the privacy at risk setting, risk, it also satisfies (ε1, γ1)-privacy at risk as well as of the advanced composition of differential privacy [12, (ε2, γ2)-privacy at risk for some ε1, ε2 ≤ ε0 and appro- Theorem 3.20] under a constraint of independent evalu- 1 priately computed values of γ1 and γ2. Let M and γ = 0 ations. Note that if one takes , then we obtain the M2 denote the hypothetical mechanisms that satisfy exact same formula as in [12, Theorem 3.20]. It provides (ε1, γ1)-privacy at risk and (ε2, γ2)-privacy at risk re- a sanity check for the consistency of composition using spectively. We can write privacy loss random variables privacy at risk. as follows: 1 Corollary 1 (Heterogeneous Composition). For all C ≤ γ1ε1 + (1 − γ1)ε0 2 εl, ε, γl, δ > 0 and l ∈ {1, . . . , n}, the composition of C ≤ γ2ε2 + (1 − γ2)ε0 n {εl} -differentially private mechanisms, which satisfy l=1 where C1 and C2 denote privacy loss random variables (ε, γl)-privacy at risk under a uniform data-generation for M1 and M2. distribution, also satisfies (ε0, δ)-differential privacy Let us consider a privacy-preserving mechanism M where v M1 p M2 u n ! that uses with a probability and with a prob- u X 1 ε0 = t2 ε2 ln + µ, ability (1−p) for some p ∈ [0, 1]. By using the techniques l δ l=1 in the proof of Theorem 3, the privacy loss random vari- C M 1 2 Pn Pn 2 able for can be written as: where µ = 2 [ε ( l=1 γl) + l=1(1 − γl)εl ]. C = pC1 + (1 − p)C2 Proof. The proof follows from the same argument as 0 0 0 ≤ γ ε + (1 − γ )ε0 that of Theorem 3 of bounding the loss random variable l l where at step l using γlCε + (1 − γl)Cε and then applying the l pγ ε + (1 − p)γ ε concentration inequality. ε0 = 1 1 2 2 pγ1 + (1 − p)γ2 Differential Privacy at Risk 6

0 γ = (1 − pγ1 − (1 − p)γ2) at risk. Therefore, we keep privacy at risk for Gaussian mechanism as the future work. M (ε0, γ0) Thus, satisfies -privacy at risk. This proves In this section, we instantiate privacy at risk for the that privacy at risk satisfies convexity [23, Axiom 2.1.2]. Laplace mechanism in three cases: two cases involving two sources of randomness and a third case involving the Meiser [29] proved that a relaxation of differential pri- coupled effect. These three different cases correspond to vacy that provides probabilistic bounds on the privacy three different interpretations of the confidence level, loss random variable does not satisfy post-processing represented by the parameter γ, corresponding to three property of differential privacy. Privacy at risk is indeed interpretations of the support of the outer probability such a probabilistic relaxation. in Definition 5. In order to highlight this nuance, we denote the confidence levels corresponding to the three Corollary 2 (Post-processing). Privacy at risk does cases and their three sources of randomness as γ1, γ2, not satisfy the post-processing property for every pos- and γ3, respectively. sible mapping of the output.

Though privacy at risk is not preserved after post- 4.1 The Case of Explicit Randomness processing, it yields a weaker guarantee in terms of ap- proximate differential privacy after post-processing. The In this section, we study the effect of the explicit ran- proof involves reduction of privacy at risk to approxi- domness induced by the noise sampled from Laplace mate differential privacy and preservation of approxi- distribution. We provide a probabilistic quantification mate differential privacy under post-processing. for fine tuning for the Laplace mechanism. We fine-tune the privacy level for a specified risk under by assuming Lemma 3 (Weak Post-processing). Let M : D → R ⊆ that the sensitivity of the query is known a priori. k R be a mechanism that satisfy (ε, γ)-privacy at risk and ∆f For a Laplace mechanism Lε0 calibrated with sensi- f : R → R0 be any arbitrary data independent map- tivity ∆f and privacy level ε0, we present the analytical ping. Then, f ◦ M : D → R0 would also satisfy (ε, γ)- formula relating privacy level ε and the risk γ1 in The- approximate differential privacy. orem 4. The proof is available in Appendix B.

Proof. Let us fix a pair of neighbouring datasets x and Theorem 4. The risk γ1 ∈ [0, 1] with which a Laplace 0 0 y, and also an event Z ⊆ R . Let us define pre-image ∆f k Mechanism Lε , for a numeric query f : D → sat- 0 0 R of Z as Z , {r ∈ R : f(r) ∈ Z}. Now, we get isfies a privacy level ε ≥ 0 is given by 0 P(f ◦ M(x) ∈ Z ) = P(M(x) ∈ Z) (T ≤ ε) γ = P , ε 1 (3) ≤ e P(M(y) ∈ Z) + γ P(T ≤ ε0) (a) where T is a random variable that follows a distribution = eε (f ◦ M(y) ∈ Z0) + δ P with the following density function.

(a) is a direct consequence of Theorem 2. 1−k k− 1 2 t 2 K 1 (t)ε k− 2 0 PT (t) = √ 2πΓ(k)∆f

where Kn− 1 is the Bessel function of second kind. 4 Privacy at Risk for Laplace 2

Mechanism Figure 1a shows the plot of the privacy level against risk for different values of k and for a Laplace mecha- 1.0 The Laplace and Gaussian mechanisms are widely used nism L1.0. As the value of k increases, the amount of privacy-preserving mechanisms in the literature. The noise added in the output of numeric query increases. Laplace mechanism satisfies pure ε-differential privacy Therefore, for a specified privacy level, the privacy at whereas the Gaussian mechanism satisfies approximate risk level increases with the value of k. (ε, δ)-differential privacy. As previously discussed, it is The analytical formula representing γ1 as a func- not straightforward to establish a connection between tion of ε is bijective. We need to invert it to obtain the the non-probabilistic parameter δ of approximate differ- privacy level ε for a privacy at risk γ1. However the an- ential privacy and the probabilistic bound γ of privacy alytical closed form for such an inverse function is not Differential Privacy at Risk 7

∆Sf explicit. We use a numerical approach to compute pri- For the Laplace mechanism Lε calibrated with vacy level for a given privacy at risk from the analytical sampled sensitivity ∆Sf and privacy level ε, we evalu- formula of Theorem 4. ate the empirical risk γˆ2. We present the result in The- Result for a Real-valued Query. For the case orem 5. The proof is available in Appendix C. k = 1, the analytical derivation is fairly straightfor- ward. In this case, we obtain an invertible closed-form Theorem 5. Analytical bound on the empirical risk, ∆Sf of a privacy level for a specified risk. It is presented in γˆ2, for Laplace mechanism Lε with privacy level ε Equation 4. k and sampled sensitivity ∆Sf for a query f : D → R is

 1  2 γˆ ≥ γ (1 − 2e−2ρ n) ε = ln −ε (4) 2 2 (5) 1 − γ1(1 − e 0 ) where n is the number of samples used for estimation of Remarks on ε . For k = 1, Figure 1b shows the 0 the sampled sensitivity and ρ is the accuracy parameter. plot of privacy at risk level ε versus privacy at risk γ1 1.0 γ2 denotes the specified absolute risk. for the Laplace mechanism Lε0 . As the value of ε0 in- creases, the probability of Laplace mechanism generat- The error parameter ρ controls the closeness between ing higher value of noise reduces. Therefore, for a fixed the empirical cumulative distribution of the sensitivity privacy level, privacy at risk increases with the value of to the true cumulative distribution of the sensitivity. ε . The same observation is made for k > 1. 0 Lower the value of the error, closer is the empirical cu- mulative distribution to the true cumulative distribu- tion. Mathematically, 4.2 The Case of Implicit Randomness n ρ ≥ sup |FS (∆) − FS(∆)|, In this section, we study the effect of the implicit ran- ∆ n domness induced by the data-generation distribution to where FS is the empirical cumulative distribution of provide a fine tuning for the Laplace mechanism. We sensitivity after n samples and FS is the actual cumu- fine-tune the risk for a specified privacy level without lative distribution of sensitivity. assuming that the sensitivity of the query. Figure 2 shows the plot of number of samples as a If one takes into account randomness induced by function of the privacy at risk and the error parameter. the data-generation distribution, all pairs of neighbour- Naturally, we require higher number of samples in order ing datasets are not equally probable. This leads to es- to have lower error rate. The number of samples reduces timation of sensitivity of a query for a specified data- as the privacy at risk increases. The lower risk demands generation distribution. If we have access to an ana- precision in the estimated sampled sensitivity, which in lytical form of the data-generation distribution and to turn requires larger number of samples. the query, we could analytically derive the sensitivity If the analytical form of the data-generation distri- distribution for the query. In general, we have access bution is not known a priori, the empirical distribution to the datasets, but not the data-generation distribu- of sensitivity can be estimated in two ways. The first tion that generates them. We, therefore, statistically way is to fit a known distribution on the available data estimate sensitivity by constructing an empirical dis- and later use it to build an empirical distribution of the tribution. We call the sensitivity value obtained for a sensitivities. The second way is to sub-sample from a specified risk from the empirical cumulative distribu- large dataset in order to build an empirical distribution tion of sensitivity the sampled sensitivity (Definition 7). of the sensitivities. In both of these ways, the empirical However, the value of sampled sensitivity is simply an distribution of sensitivities captures the inherent ran- estimate of the sensitivity for a specified risk. In or- domness in the data-generation distribution. The first der to capture this additional uncertainty introduced way suffers from the goodness of the fit of the known by the estimation from the empirical sensitivity distri- distribution to the available data. An ill-fit distribution bution rather than the true unknown distribution, we does not reflect the true data-generation distribution compute a lower bound on the accuracy of this esti- and hence introduces errors in the sensitivity estima- mation. This lower bound yields a probabilistic lower tion. Since the second way involves subsampling, it is bound on the specified risk. We refer to it as empirical immune to this problem. The quality of sensitivity es- risk. For a specified absolute risk γ2, we denote by γˆ2 timates obtained by sub-sampling the datasets depend corresponding empirical risk. on the availability of large population. Differential Privacy at Risk 8

1.0 =1 1.0 0=1.0 = 0.0001 8 =2 0=1.5 10 = 0.0010 =3 =2.0 0 = 0.0100 =2.5 0.8 0.8 0 107 = 0.1000

) ) 6

) 10

0.6 0.6

105

0.4 0.4

Privacy level ( level Privacy ( level Privacy 104 Sample size ( size Sample

0.2 0.2 103

102 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Privacy at Risk (γ1) Privacy at Risk (γ1) Privacy at Risk (γ2) (a) (b) Fig. 1. Privacy level ε for varying privacy at risk γ for Laplace mechanism L1.0. In Fig. 2. Number of samples n for varying pri- 1 ε0 Figure 1a, we use ε0 = 1.0 and different values of k. In Figure 1b, for k = 1 and vacy at risk γ2 for different error parameter different values of ε0. ρ.

Let, G denotes the data-generation distribution, ei- 4.3 The Case of Explicit and Implicit ther known apriori or constructed by subsampling the Randomness available data. We adopt the procedure of [38] to sam- ple two neighbouring datasets with p data points each. In this section, we study the combined effect of both We sample p−1 data points from G that are common to explicit randomness induced by the noise distribution both of these datasets and later two more data points, and implicit randomness in the data-generation distri- independently. From those two points, we allot one data bution respectively. We do not assume the knowledge of point to each of the two datasets. the sensitivity of the query. Let, Sf = kf(x) − f(y)k1 denotes the sensitivity We estimate sensitivity using the empirical cumula- random variable for a given query f, where x and y tive distribution of sensitivity. We construct the empiri- are two neighbouring datasets sampled from G. Using cal distribution over the sensitivities using the sampling n pairs of neighbouring datasets sampled from G, we technique presented in the earlier case. Since we use construct the empirical cumulative distribution, Fn, for the sampled sensitivity (Definition 7) to calibrate the the sensitivity random variable. Laplace mechanism, we estimate the empirical risk γˆ3. ∆Sf For Laplace mechanism Lε calibrated with sam- Definition 7. For a given query f and for a specified 0 pled sensitivity ∆Sf and privacy level ε0, we present risk γ2, sampled sensitivity, ∆Sf , is defined as the value the analytical bound on the empirical sensitivity γˆ3 in of sensitivity random variable that is estimated using Theorem 6 with proof in the Appendix D. its empirical cumulative distribution function, Fn, con- structed using n pairs of neighbouring datasets sampled Theorem 6. Analytical bound on the empirical risk from the data-generation distribution G. γˆ3 ∈ [0, 1] to achieve a privacy level ε > 0 for Laplace ∆Sf −1 mechanism Lε with sampled sensitivity ∆S of a ∆S F (γ2) 0 f f , n k query f : D → R is

−2ρ2n γˆ3 ≥ γ3(1 − 2e ) (6) If we knew analytical form of the data generation dis- where n is the number of samples used for estimating tribution, we could analytically derive the cumulative the sensitivity, ρ is the accuracy parameter. γ denotes distribution function of the sensitivity, F , and find the 3 −1 the specified absolute risk defined as: sensitivity of the query as ∆f = F (1). Therefore, in order to have the sampled sensitivity close to the sensi- P(T ≤ ε) γ3 = · γ2 tivity of the query, we require the empirical cumulative P(T ≤ ηε0) distributions to be close to the cumulative distribution Here, η is of the order of the ratio of the true sensitivity of the sensitivity. We use this insight to derive the ana- of the query to its sampled sensitivity. lytical bound in the Theorem 5. The error parameter ρ controls the closeness between the empirical cumulative distribution of the sensitivity to the true cumulative distribution of the sensitivity. Figure 3 shows the dependence of the error parameter Differential Privacy at Risk 9

1.0 =0.010 1.0 =10000 =0.012 =15000 =0.015 =20000 0.8 =0.020 0.8 =25000 ) ) 0.6 0.6

0.4 0.4 Privacy level ( level Privacy ( level Privacy

0.2 0.2

0.0 0.0

0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

Privacy at Risk (γ3) Privacy at Risk (γ3)

(a) (b) ∆ Sf Fig. 3. Dependence of error and number of samples on the privacy at risk for Laplace mechanism L1.0 . For the figure on the left hand side, we fix the number of samples to 10000. For the Figure 3b we fix the error parameter to 0.01. on the number of samples. In Figure 3a, we observe that 5 Minimising Compensation for a fixed number of samples and a privacy level, the privacy at risk decreases with the value of error param- Budget for Privacy at Risk eter. For a fixed number of samples, smaller values of the error parameter reduce the probability of similarity Many service providers collect users’ data to enhance between the empirical cumulative distribution of sensi- user experience. In order to avoid misuse of this data, tivity and the true cumulative distribution. Therefore, we require a legal framework that not only limits the we observe the reduction in the risk for a fixed privacy use of the collected data but also proposes reparative level. In Figure 3b, we observe that for a fixed value of measures in case of a data leak. General Data Protection error parameter and a fixed level of privacy level, the Regulation (GDPR)4 is such a legal framework. risk increases with the number of samples. For a fixed Section 82 in GDPR states that any person who suf- value of the error parameter, larger values of the sam- fers from material or non-material damage as a result of ple size increase the probability of similarity between a personal data breach has the right to demand compen- the empirical cumulative distribution of sensitivity and sation from the data processor. Therefore, every GDPR the true cumulative distribution. Therefore, we observe compliant business entity that either holds or processes the increase in the risk for a fixed privacy level. personal data needs to secure a certain budget in the Effect of the consideration of implicit and explicit scenario of the personal data breach. In order to re- randomness is evident in the analytical expression for duce the risk of such an unfortunate event, the business γ3 in Equation 7. Proof is available in Appendix D. The entity may use privacy-preserving mechanisms that pro- privacy at risk is composed of two factors whereas the vide provable privacy guarantees while publishing their second term is a privacy at risk that accounts for inher- results. In order to calculate the compensation budget ent randomness. The first term takes into account the for a business entity, we devise a cost model that maps implicit randomness of the Laplace distribution along the privacy guarantees provided by differential privacy with a coupling coefficient η. We define η as the ratio and privacy at risk to monetary costs. The discussions of the true sensitivity of the query to its sampled sen- demonstrate the usefulness of probabilistic quantifica- sitivity. We provide an approximation to estimate η in tion of differential privacy in a business setting. the absence of knowledge of the true sensitivity. It can be found in Appendix D.

P(T ≤ ε) γ3 , · γ2 (7) P(T ≤ ηε0)

4 https://gdpr-info.eu/ Differential Privacy at Risk 10

5.1 Cost Model for Differential Privacy Equation 9.

Epar(ε, γ) γEdp + (1 − γ)Edp (9) Let E be the compensation budget that a business en- ε0 , ε ε0 tity has to pay to every stakeholder in case of a per- Note that the analysis in this section is specific to sonal data breach when the data is processed without the cost model in Equation 8. It naturally extends to dp any provable privacy guarantees. Let Eε be the com- any choice of convex cost model. pensation budget that a business entity has to pay to every stakeholder in case of a personal data breach when the data is processed with privacy guarantees in terms 5.2.1 Existence of Minimum Compensation Budget of ε-differential privacy.

Privacy level, ε, in ε-differential privacy is the quan- We want to find the privacy level, say εmin, that yields tifier of indistinguishability of the outputs of a privacy- the lowest compensation budget. We do that by min- preserving mechanism when two neighbouring datasets imising Equation 9 with respect to ε. are provided as inputs. When the privacy level is zero, the privacy-preserving mechanism outputs all results Lemma 4. For the choice of cost model in Equation 8, par with equal probability. The indistinguishability reduces Eε0 (ε, γ) is a convex function of ε. with increase in the privacy level. Thus, privacy level of zero bears the lowest risk of personal data breach and By Lemma 4, there exists a unique εmin that minimises dp the risk increases with the privacy level. Eε needs to the compensation budget for a specified parametrisa- be commensurate to such a risk and, therefore, it needs tion, say ε0. Since the risk γ in Equation 9 is itself to satisfy the following constraints. a function of privacy level ε, analytical calculation of ≥0 dp 1. For all ε ∈ R , Eε ≤ E. εmin is not possible in the most general case. When the dp 2. Eε is a monotonically increasing function of ε. output of the query is a real number, i. e. k = 1, we de- dp 3. As ε → 0, Eε → Emin where Emin is the unavoid- rive the analytic form (Equation 4) to compute the risk able cost that business entity might need to pay in under the consideration of explicit randomness. In such case of personal data breach even after the privacy a case, εmin is calculated by differentiating Equation 9 measures are employed. with respect to ε and equating it to zero. It gives us dp 4. As ε → ∞, Eε → E. Equation 10 that we solve using any root finding tech- nique such as Newton-Raphson method [37] to compute There are various functions that satisfy these con- εmin. straints. In absence of any further constraints, we model  ε  dp 1 1 − e 1 Eε as defined in Equation (8). − ln 1 − = 2 (10) ε ε ε0 dp − c Eε , Emin + Ee ε . (8) dp Eε has two parameters, namely c > 0 and Emin ≥ 0. 5.2.2 Fine-tuning Privacy at Risk c controls the rate of change in the cost as the privacy level changes and Emin is a privacy level independent For a fixed budget, say B, re-arrangement of Equation 9 bias. For this study, we use a simplified model with c = 1 gives us an upper bound on the privacy level ε. We use and Emin = 0. the cost model with c = 1 and Emin = 0 to derive the upper bound. If we have a maximum permissible ex- pected mean absolute error T , we use Equation 12 to 5.2 Cost Model for Privacy at Risk obtain a lower bound on the privacy at risk level. Equa- tion 11 illustrates the upper and lower bounds that dic- Epar(ε, γ) Let, ε0 be the compensation that a business en- tate the permissible range of ε that a data publisher can tity has to pay to every stakeholder in case of a per- promise depending on the budget and the permissible sonal data breach when the data is processed with an error constraints. ε -differentially private privacy-preserving mechanism 0 1   γE −1 along with a probabilistic quantification of privacy level. ≤ ε ≤ ln (11) T B − (1 − γ)Edp Use of such a quantification allows us to provide a ε0 stronger privacy guarantee viz. ε < ε0 for a specified Thus, the privacy level is constrained by the ef- par privacy at risk at most γ. Thus, we calculate Eε0 using fectiveness requirement from below and by the mone- Differential Privacy at Risk 11

sure of effectiveness for the Laplace mechanism. 1 120000 |L1(x) − f(x)| = (12) E ε ε

100000 Equation 12 makes use of the fact that the sensitivity of the count query is one. Suppose that the health centre

80000 requires the expected mean absolute error of at most (in dollars) (in

two in order to maintain the quality of the published

60000 statistics. In this case, the privacy level has to be at 0.5 ε0=0.7 least .

ε0=0.6 40000 In order to compute the budget, the health cen- ε0=0.5 tre requires an estimate of E. Moriarty et al. [30] show 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Privacy level (ε) that the incremental cost of premiums for the health insurance with morbid obesity ranges between $5467 to Fig. 4. Variation in the budget for Laplace mechanism L1 under ε0 $5530. With reference to this research, the health cen- privacy at risk considering explicit randomness in the Laplace tre takes $5500 as an estimate of E. For the staff size mechanism for the illustration in Section 5.3. of 100 and the privacy level 0.5, the health centre uses tary budget from above. [19] calculate upper and lower Equation 8 in its simplified setting to compute the total bound on the privacy level in the differential privacy. budget of $74434.40. They use a different cost model owing to the scenario Is it possible to reduce this budget without degrad- of research study that compensates its participants for ing the effectiveness of the Laplace mechanism? We their data and releases the results in a differentially show that it is possible by fine-tuning the Laplace mech- private manner. Their cost model is different than our anism. Under the consideration of the explicit random- GDPR inspired modelling. ness introduced by the Laplace noise distribution, we show that ε0-differentially private Laplace mechanism also satisfies ε-differential privacy with risk γ, which is 5.3 Illustration computed using the formula in Theorem 4. Fine-tuning allows us to get a stronger privacy guarantee, ε < ε0 Suppose that the health centre in a university that com- that requires a smaller budget. In Figure 4, we plot the plies to GDPR publishes statistics of its staff health budget for various privacy levels. We observe that the checkup, such as obesity statistics, twice in a year. In privacy level 0.274, which is same as εmin computed January 2018, the health centre publishes that 34 out of by solving Equation 10, yields the lowest compensation 99 faculty members suffer from obesity. In July 2018, the budget of $37805.86. Thus, by using privacy at risk, the health centre publishes that 35 out of 100 faculty mem- health centre is able to save $36628.532 without sacri- bers suffer from obesity. An intruder, perhaps an analyst ficing the quality of the published results. working for an insurance company, checks the staff list- ings in January 2018 and July 2018, which are publicly available on website of the university. The intruder does 5.4 Cost Model and the Composition of not find any change other than the recruitment of John Laplace Mechanisms Doe in April 2018. Thus, with high probability, the in- truder deduces that John Doe suffers from obesity. In Convexity of the proposed cost function enables us to es- order to avoid such a privacy breach, the health centre timate the optimal value of the privacy at risk level. We decides to publish the results using the Laplace mecha- use the optimal privacy value to provide tighter bounds nism. In this case, the Laplace mechanism operates on on the composition of Laplace mechanism. In Figure 5, the count query. we compare the privacy guarantees obtained by using In order to control the amount of noise, the health basic composition theorem [12], advanced composition centre needs to appropriately set the privacy level. Sup- theorem [12] and the composition theorem for privacy pose that the health centre decides to use the expected at risk. We comparatively evaluate them for composi- mean absolute error, defined in Equation 12, as the mea- tion of Laplace mechanisms with privacy levels 0.1, 0.5 and 1.0. We compute the privacy level after composition by setting δ to 10−5. Differential Privacy at Risk 12

00 0( 0 ( 0 00 0

0 0 00 0 0 ( 0 )( () ′ ′ ′ 00

0 00

00

0 00

00 (( (( )

0 0

0 0 00 0 00 0 00 0 50 100 150 200 250 300 0 0 00 0 00 0 00 n Number n n

1 (b) L1 satisfies . , . -privacy at risk. (c) L1 satisfies . , . -privacy at risk. (a) L0.1 satisfies (0.08, 0.80)-privacy at risk. 0.5 (0 27 0 61) 1.0 (0 42 0 54)

Fig. 5. Comparing the privacy guarantee obtained by basic composition and advanced composition [12] with the composition obtained using optimal privacy at risk that minimises the cost of Laplace mechanism L1 . For the evaluation, we set δ = 10−5. ε0

We observe that the use of optimal privacy at risk which optimally satisfies (0.08, 0.8)-privacy at risk. We provided significantly stronger privacy guarantees as list the calculated privacy guarantees in Table 1. The re- compared to the conventional composition theorems. ported privacy guarantee is the mean privacy guarantee Advanced composition theorem is known to provide over 30 experiments. stronger privacy guarantees for mechanism with smaller εs. As we observe in Figure 5c and Figure 5b, the compo- sition provides strictly stronger privacy guarantees than 6 Balancing Utility and Privacy basic composition, in the cases where the advanced com- position fails. In this section, we empirically illustrate and discuss the steps that a data steward needs to take and the issues that she needs to consider in order to realise a required Comparison with the Moment Accountant privacy at risk level ε for a confidence level γ when seek- ing to disclose the result of a query. Papernot et al. [33, 34] empirically showed that the We consider a query that returns the parameter privacy guarantees provided by the advanced compo- of a ridge regression [31] for an input dataset. It is a sition theorem are quantitatively worse than the ones basic and widely used statistical analysis tool. We use achieved by the state-of-the-art moment accountant [1]. the privacy-preserving mechanism presented by Ligett The moment accountant evaluates the privacy guaran- et al. [26] for ridge regression. It is a Laplace mech- tee by keeping track of various moments of privacy loss anism that induces noise in the output parameters of random variables. The computation of the moments is the ridge regression. The authors provide a theoretical performed by using numerical methods on the specified upper bound on the sensitivity of the ridge regression, dataset. Therefore, despite the quantitative strength of which we refer as sensitivity, in the experiments. privacy guarantee provided by the moment accountant, it is qualitatively weaker, in a sense that it is specific to the dataset used for evaluation, in constrast to advanced 6.1 Dataset and Experimental Setup. composition. Papernot et al. [33] introduced the PATE frame- We conduct experiments on a subset of the 2000 US work that uses the Laplace mechanism to provide pri- census dataset provided by Minnesota Population Cen- vacy guarantees for a machine learning model trained ter in its Integrated Public Use Microdata Series [39]. in an ensemble manner. We comparatively evaluate the The census dataset consists of 1% sample of the original privacy guarantees provided by their moment accoun- census data. It spans over 1.23 million households with tant on MNIST dataset with the privacy guarantees ob- records of 2.8 million people. The value of several at- tained using privacy at risk. We do so by using privacy tributes is not necessarily available for every household. at risk while computing a data dependent bound [33, We have therefore selected 212, 605 records, correspond- Theorem 3]. Under the identical experimental setup, ing to the household heads, and 6 attributes, namely, we use a 0.1-differentially private Laplace mechanism, Differential Privacy at Risk 13

Privacy level for moment accountant(ε) δ #Queries with differential privacy [33] with privacy at risk

10−5 100 2.04 1.81 10−5 1000 8.03 5.95 Table 1. Comparative analysis of privacy levels computed using three composition theorems when applied to 0.1-differentially private Laplace mechanism, which optimally satisfies (0.08, 0.8)-privacy at risk. The observations for the moment accountant on MNIST datasets are taken from [33].

Age, Gender, Race, Marital Status, Education, Income, at risk level ε, the confidence level γ1 and the privacy whose values are available for the 212, 605 records. level of noise ε0. Specifically, for given ε and γ1, she In order to satisfy the constraint in the derivation of computes ε0 by solving the equation: the sensitivity of ridge regression [26], we, without loss γ (T ≤ ε ) − (T ≤ ε) = 0. of generality, normalise the dataset in the following way. 1P 0 P Income We normalise attribute such that the values lie Since the equation does not give an analytical formula in [0, 1]. We normalise other attributes such that l2 norm for ε0, the data steward uses a root finding algorithm of each data point is unity. such as Newton-Raphson method [37] to solve the above All experiments are run on Linux machine with equation. For instance, if she needs to achieve a privacy ® 12-core 3.60GHz Intel Core i7™processor with 64GB at risk level ε = 0.4 with confidence level γ = 0.6, ® 1 memory. Python 2.7.6 is used as the scripting lan- she can substitute these values in the above equation guage. and solve the equation to get the privacy level of noise ε0 = 0.8. Figure 6 shows the variation of privacy at risk level 6.2 Result Analysis ε and confidence level γ1. It also depicts the variation of utility loss for different privacy at risk levels in Figure 6. Income We train ridge regression model to predict using In accordance to the data steward’s problem, if she other attributes as predictors. We split the dataset into needs to achieve a privacy at risk level ε = 0.4 with the training dataset (80%) and testing dataset (20%). confidence level γ1 = 0.6, she obtains the privacy level We compute the root mean squared error (RMSE) of of noise to be ε0 = 0.8. Additionally, we observe that ridge regression, trained on the training data with regu- the choice of privacy level 0.8 instead of 0.4 to calibrate 0.01 larisation parameter set to , on the testing dataset. the Laplace mechanism gives lower utility loss for the utility loss We use it as the metric of . Smaller the value data steward. This is the benefit drawn from the risk of RMSE, smaller the loss in utility. For a given value taken under the control of privacy at risk. of privacy at risk level, we compute 50 runs of an ex- Thus, she uses privacy level ε0 and the sensitivity periment of a differentially private ridge regression and of the function to calibrate Laplace mechanism. report the means over the 50 runs of the experiment. The Case of Implicit Randomness (cf. Sec- Let us now provide illustrative experiments under tion 4.2). In this scenario, the data steward does not the three different cases. In every scenario, the data know the sensitivity of ridge regression. She assesses ε steward is given a privacy at risk level and the con- that she can afford to sample at most n times from the γ fidence level and wants to disclose the parameters of population dataset. She understands the effect of the a ridge regression model that she trains on the census uncertainty introduced by the statistical estimation of dataset. She needs to calibrate the Laplace mechanism the sensitivity. Therefore, she uses the confidence level by estimating either its privacy level ε0 (Case 1) or sen- for empirical privacy at risk γˆ2. sitivity (Case 2) or both (Case 3) to achieve the privacy Given the value of n, she chooses the value of the at risk required the ridge regression query. accuracy parameter using Figure 2. For instance, if the The Case of Explicit Randomness (cf. Sec- number of samples that she can draw is 104, she chooses tion 4.1). In this scenario, the data steward knows the the value of the accuracy parameter ρ = 0.01. Next, she sensitivity for the ridge regression. She needs to compute uses Equation 13 to determine the value of probabilistic ε the privacy level, 0, to calibrate the Laplace mecha- tolerance, α, for the sample size n. For instance, if the nism. She uses Equation 3 that links the desired privacy data steward is not allowed to access more than 15, 000 Differential Privacy at Risk 14

Fig. 6. Utility, measured by RMSE (right y-axis), and privacy at Fig. 7. Empirical cumulative distribution of the sensitivities of ridge risk level ε for Laplace mechanism (left y-axis) for varying confi- regression queries constructed using 15000 samples of neighboring dence levels γ1. datasets. samples, for the accuracy of 0.01 the probabilistic toler- Equation 25 and Equation 23, she calculates: ance is 0.9. 2 γˆ3P(T ≤ ηε0) − αγ2 P(T ≤ ε) = 0 α = 1 − 2e(−2ρ n) (13) She solves such an equation for ε0 using the root find- She constructs an empirical cumulative distribution over ing technique such as Newton-Raphson method [37]. For the sensitivities as described in Section 4.2. Such an instance, if she needs to achieve a privacy at risk level empirical cumulative distribution is shown in Figure 7. ε = 0.4 with confidence levels γˆ3 = 0.9 and γ2 = 0.9, she Using the computed probabilistic tolerance and desired can substitute these values and the values of tolerance confidence level γˆ , she uses equation in Theorem 5 to 2 parameter and sampled sensitivity, as used in the pre- determine γ . She computes the sampled sensitivity us- 2 vious experiments, in the above equation. Then, solving ing the empirical distribution function and the confi- the equation leads to the privacy level of noise ε0 = 0.8. dence level for privacy ∆ at risk γ . For instance, Sf 2 Thus, she re-calibrates the Laplace mechanism with using the empirical cumulative distribution in Figure 7 privacy level ε0, sets the number of samples to be n and she calculates the value of the sampled sensitivity to sampled sensitivity ∆Sf . be approximately 0.001 for γ2 = 0.4 and approximately 0.01 for γ2 = 0.85 Thus, she uses privacy level ε, sets the number of samples to be n and computes the sampled sensitivity 7 Related Work

∆Sf to calibrate the Laplace mechanism. The Case of Explicit and Implicit Random- Calibration of mechanisms. Researchers have pro- ness (cf. Section 4.3). In this scenario, the data stew- posed different privacy-preserving mechanisms to make ard does not know the sensitivity of ridge regression. She different queries differentially private. These mecha- is not allowed to sample more than n times from a pop- nisms can be broadly classified into two categories. In ulation dataset. For a given confidence level γ2 and the one category, the mechanisms explicitly add calibrated privacy at risk ε, she calibrates the Laplace mechanism noise, such as Laplace noise in the work of [11] or Gaus- using illustration for Section 4.3. The privacy level in sian noise in the work of [12], to the outputs of the query. this calibration yields utility loss that is more than her In the other category, [2, 6, 17, 41] propose mechanisms requirement. Therefore, she wants to re-calibrate the that alter the query function so that the modified func- Laplace mechanism in order to reduce utility loss. tion satisfies differentially privacy. Privacy-preserving For the re-calibration, the data steward uses pri- mechanisms in both of these categories perturb the orig- vacy level of the pre-calibrated Laplace mechanism, i.e. inal output of the query and make it difficult for a ma- ε, as the privacy at risk level and she provides a new licious data analyst to recover the original output of confidence level for empirical privacy at risk γˆ3. Using the query. These mechanisms induce randomness us- Differential Privacy at Risk 15 ing the explicit noise distribution. Calibration of these Around the same time of our work, Triastcyn et mechanisms require the knowledge of the sensitivity of al. [40] independently propose Bayesian differential pri- the query. Nissim et al. [32] consider the implicit ran- vacy that takes into account both of the sources of ran- domness in the data-generation distribution to compute domness. Despite this similarity, our works differ in mul- an estimate of the sensitivity. The authors propose the tiple dimensions. Firstly, they have shown the reduction smooth sensitivity function that is an envelope over the of their definition to a variant of Renyi differential pri- local sensitivities for all individual datasets. Local sen- vacy. The variant depends on the data-generation distri- sitivity of a dataset is the maximum change in the value bution. Secondly, they rely on the moment accountant of the query over all of its neighboring datasets. In gen- for the composition of the mechanisms. Lastly, they do eral, it is not easy to analytically estimate the smooth not provide a finer case-by-case analysis of the source of sensitivity function for a general query. Rubinstein et randomness, which leads to analytical solutions for the al. [38] also study the inherent randomness in the data- privacy guarantee. generation algorithm. We adopt their approach of sam- Kifer et al. [24] define Pufferfish privacy framework, pling the sensitivity from the empirical distribution of and its variant by Bassily et al. [4], that considers ran- the sensitivity. They use order statistics to choose a par- domness due to data-generation distribution as well as ticular value of the sensitivity. We use the risk, which noise distribution. Despite the generality of their ap- provides a mediation tool for business entities to assess proach, the framework relies on the domain expert to the actual business risks, on the sensitivity distribution define a set of secrets that they want to protect. to estimate the sensitivity. We refer interested readers to [8] for an extensive Refinements of differential privacy. In order to review of the differential privacy and its refinements. account for both sources of randomness, refinements of Composition theorem. Recently proposed tech- ε-differential privacy are proposed in order to bound nique of the moment accountant [1] has become the the probability of occurrence of worst case scenarios. state-of-the-art of composing mechanisms in the area Machanavajjhala et al. [27] propose probabilistic differ- of privacy-preserving machine learning. Abadi et al. ential privacy that considers upper bounds of the worst show that the moment accountant provides much strong case privacy loss for corresponding confidence levels on privacy guarantees than the conventional composition the noise distribution. Definition of probabilistic differ- mechanisms. It works by keeping track of various mo- ential privacy incorporates the explicit randomness in- ments of privacy loss random variable and use the duced by the noise distribution and bounds the prob- bounds on them to provide privacy guarantees. The mo- ability over the space of noisy outputs to satisfy the ment accountant requires access to data-generation dis- ε-differential privacy definition. Dwork et al. [13] pro- tribution to compute the bounds on the moment. Hence, pose Concentrated differential privacy that considers the privacy guarantees are specific to the dataset. the expected values of the privacy loss random variables Cost models. [7, 15] propose game theoretic meth- for the corresponding. Definition of concentrated differ- ods that provide the means to evaluate the monetary ential privacy incorporates the explicit randomness in- cost of differential privacy. Pejó et al. [36] also propose duced by the noise distribution but considering only the a game theoretic cost model in the setting of private expected value of privacy loss satisfying ε-differential collaborative learning. Our approach is inspired by the privacy definition instead of using the confidence levels approach in the work of Hsu et al. [19]. They model limits its scope. the cost under a scenario of a research study wherein Hall et al. [17] propose random differential privacy the participants are reimbursed for their participation. that considers the privacy loss for corresponding con- Our cost modelling is driven by the scenario of secur- fidence levels on the implicit randomness in the data- ing a compensation budget in compliance with GDPR. generation distribution. Definition of random differen- Our requirement differs from the requirements of [19], tial privacy incorporates the implicit randomness in- particularly in our model participants do not have any duced by the data-generation distribution and bounds monetary incentive to share their data. the probability over the space of datasets generated from the given distribution to satisfy the ε-differential privacy definition. Dwork et al. [9] define approximate differential privacy by adding a constant bias to the pri- vacy guarantee provided by the differential privacy. It is not a probabilistic refinement of the differential privacy. Differential Privacy at Risk 16

8 Conclusion and Future Works sion. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pages 1–10. IEEE, 2012. [3] RA Askey and AB Olde Daalhuis. Generalized hypergeo- In this paper, we provide a means to fine-tune the metric functions and meijer g-function. NIST handbook of privacy level of a privacy-preserving mechanism by mathematical functions, pages 403–418, 2010. analysing various sources of randomness. Such a fine- [4] Raef Bassily, Adam Groce, Jonathan Katz, and Adam Smith. Coupled-worlds privacy: Exploiting adversarial uncertainty in tuning leads to probabilistic quantification on privacy statistical data privacy. In 2013 IEEE 54th Annual Sympo- levels with quantified risks, which we call as privacy at sium on Foundations of Computer Science, pages 439–448. risk. We also provide composition theorem that lever- IEEE, 2013. ages privacy at risk. We analytical calculate privacy at [5] Mark Bun and Thomas Steinke. Concentrated differen- risk for Laplace mechanism. We propose a cost model tial privacy: Simplifications, extensions, and lower bounds. , 2016. that bridges the gap between the privacy level and the CoRR [6] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sar- compensation budget estimated by a GDPR compliant wate. Differentially private empirical risk minimization. business entity. Convexity of the cost function ensures Journal of Machine Learning Research, 12(Mar):1069–1109, existence of unique privacy at risk that minimises com- 2011. pensation budget. The cost model helps in not only re- [7] Yiling Chen, Stephen Chong, Ian A Kash, Tal Moran, and inforcing the ease of application in a business setting Salil Vadhan. Truthful mechanisms for agents that value privacy. ACM Transactions on Economics and Computation but also providing stronger privacy guarantees on the (TEAC), 4(3):13, 2016. composition of mechanism. [8] Damien Desfontaines and Balázs Pejó. Sok: Differential It is possible to instantiate privacy at risk for Gaus- privacies. Proceedings on Privacy Enhancing Technologies, sian mechanism. The mechanism is (ε, δ)-differential pri- 2020(2):288–313, 2020. vate for γ = 0 and a non-zero risk calculated by account- [9] , Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy ing for the sources of randomness. We save it for future via distributed noise generation. In Eurocrypt, volume 4004, work. Privacy at risk may be fully analytically com- pages 486–503. Springer, 2006. puted in cases where the data-generation, or the sensi- [10] Cynthia Dwork, Frank McSherry, , and Adam tivity distribution, the noise distribution and the query Smith. Calibrating noise to sensitivity in private data analy- are analytically known and take convenient forms. We sis. In Theory of Cryptography Conference, pages 265–284. are now looking at such convenient but realistic cases. Springer, 2006. [11] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating Noise to Sensitivity in Private Data Analysis, pages 265–284. Springer Berlin Heidelberg, Berlin, Acknowledgement Heidelberg, 2006. [12] Cynthia Dwork, Aaron Roth, et al. The algorithmic foun- dations of differential privacy. Foundations and Trends® in This work was supported by the National Research Theoretical Computer Science, 9(3–4):211–407, 2014. Foundation (NRF) Singapore under its Corporate Labo- [13] Cynthia Dwork and Guy N Rothblum. Concentrated differ- ratory@University Scheme, National University of Sin- ential privacy. arXiv preprint arXiv:1603.01887, 2016. gapore, and Singapore Telecommunications Ltd. This [14] Simson L Garfinkel, John M Abowd, and Sarah Powazek. Issues encountered deploying differential privacy. arXiv research was also funded in part by the BioQOP project preprint arXiv:1809.02201, 2018. of the French ANR (ANR-17-CE39-0006). We thank [15] Arpita Ghosh and Aaron Roth. Selling privacy at auction. Pierre Senellart for his help in reviewing derivations of Games and Economic Behavior, 91:334–346, 2015. privacy at risk for the Laplace mechanism. [16] Rob Hall, Alessandro Rinaldo, and Larry Wasserman. Ran- dom differential privacy. Journal of Privacy and Confidential- ity, 4(2):43–59, 2012. [17] Rob Hall, Alessandro Rinaldo, and Larry Wasserman. Differ- References ential privacy for functions and functional data. Journal of Machine Learning Research, 14(Feb):703–727, 2013. [18] Wassily Hoeffding. Probability inequalities for sums of [1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan bounded random variables. In McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep The Collected Works of Wass- ily Hoeffding, pages 409–426. Springer, 1994. learning with differential privacy. In Proceedings of the 2016 [19] Justin Hsu, Marco Gaboardi, Andreas Haeberlen, Sanjeev ACM SIGSAC Conference on Computer and Communica- Khanna, Arjun Narayan, Benjamin C Pierce, and Aaron tions Security, pages 308–318, 2016. Roth. Differential privacy: An economic method for choosing [2] Gergely Acs, Claude Castelluccia, and Rui Chen. Differen- epsilon. In tially private histogram publishing through lossy compres- Computer Security Foundations Symposium (CSF), 2014 IEEE 27th, pages 398–410. IEEE, 2014. Differential Privacy at Risk 17

[20] Wolfram Research, Inc. Mathematica, Version 10. Cham- 2019. paign, IL, 2014. [37] William H Press. Numerical recipes 3rd edition: The art of [21] Philippe Jorion. Value at risk: The new benchmark for man- scientific computing. Cambridge university press, 2007. aging financial risk, 01 2000. [38] Benjamin IP Rubinstein and Francesco Aldà. Pain-free ran- [22] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The dom differential privacy with sensitivity sampling. In Inter- composition theorem for differential privacy. In International national Conference on Machine Learning, pages 2950–2959, conference on machine learning, pages 1376–1385, 2015. 2017. [23] Daniel Kifer and Bing-Rong Lin. An axiomatic view of sta- [39] Steven Ruggles, Katie Genadek, Ronald Goeken, Josiah tistical privacy and utility. Journal of Privacy and Confiden- Grover, and Matthew Sobek. Integrated public use mi- tiality, 4(1), 2012. crodata series: Version 6.0 [dataset], 2015. [24] Daniel Kifer and Ashwin Machanavajjhala. A rigorous and [40] Aleksei Triastcyn and Boi Faltings. Federated learn- customizable framework for privacy. In Proceedings of the ing with bayesian differential privacy. arXiv preprint 31st ACM SIGMOD-SIGACT-SIGAI symposium on Princi- arXiv:1911.10071, 2019. ples of Database Systems, pages 77–88. ACM, 2012. [41] Jun Zhang, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, and [25] Jaewoo Lee and Chris Clifton. How much is enough? choos- Marianne Winslett. Functional mechanism: regression anal- ing ε for differential privacy. In International Conference on ysis under differential privacy. Proceedings of the VLDB Information Security, pages 325–340. Springer, 2011. Endowment, 5(11):1364–1375, 2012. [26] Katrina Ligett, Seth Neel, Aaron Roth, Bo Waggoner, and Steven Z Wu. Accuracy first: Selecting a differential privacy level for accuracy constrained erm. In Advances in Neural Information Processing Systems, pages 2563–2573, 2017. [27] Ashwin Machanavajjhala, Daniel Kifer, John Abowd, Jo- hannes Gehrke, and Lars Vilhuber. Privacy: Theory meets practice on the map. In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on, pages 277–286. IEEE, 2008. [28] Pascal Massart et al. The tight constant in the dvoretzky- kiefer-wolfowitz inequality. The annals of Probability, 18(3):1269–1283, 1990. [29] Sebastian Meiser. Approximate and probabilistic differen- tial privacy definitions. IACR Cryptology ePrint Archive, 2018:277, 2018. [30] James P Moriarty, Megan E Branda, Kerry D Olsen, Nilay D Shah, Bijan J Borah, Amy E Wagie, Jason S Egginton, and James M Naessens. The effects of incremental costs of smoking and obesity on health care costs among adults: a 7-year longitudinal study. Journal of Occupational and Environmental Medicine, 54(3):286–291, 2012. [31] Kevin P. Murphy. Machine Learning: A Probabilistic Per- spective. The MIT Press, 2012. [32] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 75–84. ACM, 2007. [33] Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian J. Goodfellow, and Kunal Talwar. Semi-supervised knowledge transfer for deep learning from private training data. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. [34] Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Úlfar Erlingsson. Scal- able private learning with PATE. CoRR, abs/1802.08908, 2018. [35] Athanasios Papoulis and S Unnikrishna Pillai. Probability, random variables, and stochastic processes. Tata McGraw- Hill Education, 2002. [36] Balazs Pejo, Qiang Tang, and Gergely Biczok. Together or alone: The price of privacy in collaborative learning. Pro- ceedings on Privacy Enhancing Technologies, 2019(2):47–65, Differential Privacy at Risk 18

A Proof of Theorem 2 given by (Section 3) O = 1 − (exp (−µε2/∆f ) − exp (−µε1/∆f )), where µ = ∆f ln (ε1/ε2) . ε1−ε2 Proof. Let us fix a pair of neighbouring datasets x and y, Using the result in Lemma 5, we note that the overlap and also fix a subset of outputs Z ⊆ Range(M). Choose ε = 1 ε = 0.6 OUT ⊆ Z between two distributions with 0 and is such that ∆f 0.81. Thus, L0.6 induces noise that is more than 80%   ∆f P(M(f, Θ)(x) = z) ε times similar to the noise induced by L . Therefore, OUT , z ∈ Z > e 1.0 P(M(f, Θ)(y) = z) we can loosely say that at least 80% of the times a ∆f Thus, we also obtain a complementary set of outputs Laplace Mechanism L1.0 will provide the same privacy ∆f Z \ OUT, where the privacy loss is upper bounded i.e. as a Laplace Mechanism L0.8 . ε P(M(f, Θ)(x) = z) ≤ e P(M(f, Θ)(y) = z). Thus, Although the overlap between Laplace distributions with different scales offers an insight into the relation- P(M(f, Θ)(x) ∈ Z) ship between different privacy levels, it does not capture = P(M(f, Θ)(x) ∈ Z\OUT) + P(M(f, Θ)(x) ∈ OUT) the constraint induced by the sensitivity. For a given ε f ≤ e P(M(f, Θ)(y) ∈ Z\OUT) + P(M(f, Θ)(x) ∈ OUT) query , the amount of noise required to satisfy differ- (a) ential privacy is commensurate to the sensitivity of the ε ≤ e P(M(f, Θ)(y) ∈ Z \ OUT) + γ query. This calibration puts a constraint on the noise (b) that is required to be induced on a pair of neighbouring ε ≤ e P(M(f, Θ)(y) ∈ Z) + γ datasets. We state this constraint in Lemma 6, which we ∆f further use to prove that the Laplace Mechanism Lε0 (a) is obtained from the fact that the privacy loss is satisfies (ε, γ1)-privacy at risk. upper bounded in the complimentary set of outputs Z \ OUT. (b) is a consequence of the definition of (ε, γ) ∆f Lemma 6. For a Laplace Mechanism Lε0 , the differ- privacy at risk. The definition upper bounds the prob- ence in the absolute values of noise induced on a pair of ability measures of the subset OUT by γ. neighbouring datasets is upper bounded by the sensitivity of the query.

Proof. Suppose that two neighbouring datasets x and y B Proof of Theorem 4 k are given input to a numeric query f : D → R . For any k ∆f (Section 4.1) output z ∈ R of the Laplace Mechanism Lε0 , k k ∆f X X Although a Laplace mechanism Lε induces higher (|f(yi) − zi| − |f(xi) − zi|) ≤ (|f(xi) − f(yi)|) amount of noise on average than a Laplace mechanism i=1 i=1 ∆f Lε0 for ε < ε0, there is a non-zero probability that ≤ ∆f . L∆f L∆f ε induces noise commensurate to ε0 . This non-zero We use triangular inequality in the first step and Defi- probability guides us to calculate the privacy at risk γ1 nition 2 of sensitivity in the second step. for the privacy at risk level ε. In order to get an intu- ition, we illustrate the calculation of the overlap between We write Exp(b) to denote a random variable sampled two Laplace distributions as an estimator of similarity from an exponential distribution with scale b > 0. We between the two distributions. write Gamma(k, θ) to denote a random variable sampled from a gamma distribution with shape k > 0 and scale Definition 8. [Overlap of Distributions, [35]] The θ > 0. overlap, O, between two probability densities P1,P2 with support X is defined as Lemma 7. [[35]] If a random variable X follows Laplace Distribution with mean zero and scale b, |X| ∼ Z Exp(b). O = min[P1(x),P2(x)] dx. X Lemma 8. [[35]] If X1, ..., Xn are n i.i.d. random vari- Lemma 5. The overlap O between two probability dis- ables each following the Exponential Distribution with tributions, Lap( ∆f ) and Lap( ∆f ), such that ε ≤ ε , is scale b, Pn X ∼ Gamma(n, b). ε1 ε2 2 1 i=1 i Differential Privacy at Risk 19

Lemma 9. If X1 and X2 are two i.i.d. Gamma(n, θ) Proof. Let, x ∈ D and y ∈ D be two datasets such that k random variables, the probability density function for the x ∼ y. Let f : D → R be some numeric query. Let random variable T = |X1 − X2|/θ is given by Px(z) and Py(z) denote the probabilities of getting the ∆f ∆f 1 2−n n− output z for Laplace mechanisms Lε0 (x) and Lε0 (y) 2 t 2 K 1 (t) n− 2 k PT (t; n, θ) = √ respectively. For any point z ∈ R and ε 6= 0, 2πΓ(n)θ   k exp −ε0|f(xi)−zi| where Kn− 1 is the modified Bessel function of second Px(z) Y ∆f 2 =   kind. Py(z) exp −ε0|f(yi)−zi| i=1 ∆f Proof. Let X1 and X2 be two i.i.d. Gamma(n, θ) random k   Y ε (|f(yi) − zi| − |f(xi) − zi|) variables. Characteristic function of a Gamma random = exp 0 ∆f variable is given as i=1 " k #! −n P ε0 i=1(|f(yi) − zi| − |f(xi) − zi|) φX1 (z) = φX2 (z) = (1 − ιzθ) . = exp ε . ε∆f Therefore, (14) ∗ 1 φX −X (z) = φX (z)φ (z) = 1 2 1 X2 (1 + (zθ)2)n By Definition 4,

Probability density function for the random variable (f(x) − z), (f(y) − z) ∼ Lap(∆f /ε0). (15) X1 − X2 is given by, ∞ Application of Lemma 7 and Lemma 8 yields, 1 Z P (x) = e−izxφ (z)dz k X1−X2 2π X1−X2 X (|f(xi) − zi|) ∼ Gamma(k, ∆f /ε ). (16) −∞ 0 1 i=1 1−n x n− 2 x 2 | | Kn− 1 (| |) = θ√ 2 θ Using Equations 15, 16, and Lemma 6, 10, we get 2πΓ(n)θ k ! where Kn− 1 is the Bessel function of second kind. Let ε0 X 2 |(|f(y ) − z| − |f(x ) − z|)| X −X i i T = | 1 2 | ∆f θ . Therefore, i=1 1 1−n n− 0 2 t 2 K 1 (t) ∼ PT (t; k, ∆f /ε0, ∆f ). (17) n− 2 PT (t; n, θ) = √ 2πΓ(n)θ Pk since, i=1 |(|f(yi) − z| − |f(xi) − z|)| ≤ ∆f . There- We use Mathematica [20] to solve the above integral. fore,

" k # ! ε0 X P |(|f(yi) − z| − |f(xi) − z|)| ≤ ε Lemma 10. If X1 and X2 are two i.i.d. Gamma(n, θ) ∆f 0 i=1 random variables and |X1 − X2| ≤ M, then T = |X1 − P(T ≤ ε) X2|/θ follows the distribution with probability density = , (18) P(T ≤ ε0) function: 0 where T follows the distribution in Lemma 9. We PT (t ; n, θ) PT 0 (t; n, θ, M) = , use Mathematica [20] to analytically compute, PT (T ≤ M) where P is the probability density function of defined  1 3 3 x2 √  T (T ≤ x) ∝ F ( ; − k, ; ) π4kx] − in Lemma 9. P 1 2 2 2 2 4  1 x2  ∆f 2k Lemma 11. For Laplace Mechanism L with query 21F2(k; + k, k + 1; )x Γ(k) ε0 2 4 k ∆f f : D → R and for any output Z ⊆ Range(Lε0 ), ε ≤ where F is the regularised generalised hypergeometric ε0, 1 2 " # function as defined in [3]. From Equation 14 and 18, ∆f P(Lε0 (x) ∈ Z) P(T ≤ ε) γ1 ln ≤ ε = , " ∆f # , P ∆f (T ≤ ε0) P(Lε0 (x) ∈ S) P(T ≤ ε) P(Lε0 (y) ∈ Z) P ln ≤ ε = . P ∆f (L (y) ∈ S) P(T ≤ ε0) where T follows the distribution in Lemma 9, P ε0 P (t; k, ∆f ). T ε0 Differential Privacy at Risk 20

 ∆S ∆Sf   ∆Sf  This completes the proof of Theorem 4. f ≥ P B Cρ P Cρ

 ∆S  Corollary 3. Laplace Mechanism L∆f with f : D → f ε0 = Fn(∆Sf )P Cρ k R satisfies (ε, δ)-probabilistic differentially private  −2ρ2n where ≥ γ2 · 1 − 2e (21) ( P(T ≤ε) 1 − T ≤ε ε ≤ ε0 δ = P( 0) In the last step, we use the definition of the sampled 0 ε > ε 0 sensitivity to get the value of the first term. The last and T follows BesselK(k, ∆f /ε0). term is obtained using DKW-inequality, as defined in [28], where the n denotes the number of samples used to build empirical distribution of the sensitivity, Fn. C Proof of Theorem 5 From Equation 19, we understand that if kf(y) − f(x)k1 is less than or equals to the sampled sensitivity ∆Sf (Section 4.2) then the Laplace mechanism Lε satisfies ε-differential privacy. Equation 21 provides the lower bound on the Proof. Let, x and y be any two neighbouring datasets probability of the event kf(y) − f(x)k1 ≤ ∆Sf . Thus, sampled from the data generating distribution G. Let, combining Equation 19 and Equation 21 completes the ∆ f : D → k Sf be the sampled sensitivity for query R . proof. Let, Px(z) and Py(z) denote the probabilities of get- ∆Sf ting the output z for Laplace mechanisms Lε (x) and ∆Sf k Lε (y) respectively. For any point z ∈ R and ε 6= 0, D Proof of Theorem 6   k exp −ε|f(xi)−zi| (Section 4.3) Px(z) Y ∆Sf =   y(z) −ε|f(yi)−zi| P i=1 exp ∆Sf Proof of Theorem 6 builds upon the ideas from the Pk ! proofs for the rest of the two cases. In addition to the ε i=1(|f(yi) − zi| − |f(xi) − zi|) = exp events defined in Equation 20, we define an additional ∆S f ∆S A f k ! event ε0 , defined in Equation 22, as a set of outputs P ∆S ε i |f(yi) − f(xi)| f ≤ exp =1 of Laplace mechanism Lε that satisfy the constraint ∆ 0 Sf of ε-differential privacy for a specified privacy at risk εkf(y) − f(x)k  level ε. = exp 1 (19) ∆Sf

( ∆Sf ) We used triangle inequality in the penultimate step. ∆S ∆S L (x) f f ε0 Aε0 , z ∼ Lε0 : ln ≤ ε, x, y ∼ G Using the trick in the work of [38], we define fol- ∆Sf ∆S Lε (y) lowing events. Let, B f denotes the set of pairs neigh- 0 (22) bouring dataset sampled from G for which the sensitivity ∆Sf random variable is upper bounded by ∆Sf . Let, Cρ Corollary 4. denotes the set of sensitivity random variable values for ∆Sf ∆S P(T ≤ ε) which Fn deviates from the unknown cumulative distri- (A |B f ) = P ε0 (T ≤ ηε ) bution of S, F , at most by the accuracy value ρ. These P 0 events are defined in Equation 20. where T follows the distribution PT (t; ∆Sf /ε0) in ∆f ∆S Lemma 9 and η = . f ∆S B , {x, y ∼ G such that kf(y) − f(x)k1 ≤ ∆Sf } f   ∆S f n Proof. Cρ , sup |FS (∆) − FS(∆)| ≤ ρ (20) We provide the sketch of the proof. Proof follows ∆ from the proof of Lemma 11. For a Laplace mechanism

Thus, we obtain calibrated with the sampled sensitivity ∆Sf and privacy ε level 0, Equation 17 translates to,  ∆S   ∆S ∆Sf   ∆Sf  f f P B = P B Cρ P Cρ k !     ε0 X ∆S ∆Sf ∆Sf + B f C C |(|f(yi) − z| − |f(xi) − z|)| ∼ P ρ P ρ ∆S f i=1 Differential Privacy at Risk 21

0 PT (t; k, ∆Sf /ε0, ∆Sf ). where n is the number of samples used to find sampled sensitivity, ρ ∈ [0, 1] is a accuracy parameter and η = Pk |(|f(y ) − z| − |f(x ) − z|)| ≤ ∆ ∆f since, i=1 i i f . Using . The outer probability is calculated with respect to ∆Sf Lemma 10 and Equation 18, support of the data-generation distribution G.

∆Sf P(T ≤ ε) P(Aε0 ) = P(T ≤ ηε0) Proof. The proof follows from the proof of Lemma 11 and Lemma 13. Consider, where T follows the distribution PT (t; ∆Sf /ε0) and η = ∆f ∆S ∆S ∆S ∆S . f f ∆Sf ∆Sf f f ∆Sf P(Aε0 ) ≥ P(Aε0 |B )P(B |Cρ )P(Cρ )

For this case, we do not assume the knowledge of the P(T ≤ ε) −2ρ2n ≥ · γ2 · (1 − 2e ) (24) sensitivity of the query. Using the empirical estimation P(T ≤ ηε0) presented in Section 4.2, if we choose the sampled sen- The first term in the final step of Equation 24 fol- γ = 1 sitivity for privacy at risk 2 , we obtain an approx- lows from the result in Corollary 4 where T follows imation for η. ∆S BesselK(k, f ) ε0 . It is the probability with which the ∆Sf Lemma 12. For a given value of accuracy parameter Laplace mechanism Lε0 satisfies ε-differential privacy ρ, for a given value of sampled sensitivity. ! ∆f ρ ∆Sf = 1 + O Probability of occurrence of event Aε calculated by ∆∗ ∆∗ 0 Sf Sf accounting for both explicit and implicit sources of ran-   ε ∗ −1 ρ ρ domness gives the risk for privacy level . Thus, the where ∆ = F (1). O ∗ denotes order of ∗ , Sf n ∆ ∆ Sf Sf proof of Lemma 13 completes the proof for Theorem 6.   ρ ρ Comparing the equations in Theorem 6 and i.e., O ∆∗ = k ∆∗ for some k ≥ 1. Sf Sf Lemma 13, we observe that

Proof. For a given value of accuracy parameter ρ and P(T ≤ ε) γ3 , · γ2 (25) any ∆ > 0, P(T ≤ ηε0) F (∆) − F (∆) ≤ ρ n The privacy at risk, as defined in Equation 25, is free Since above inequality is true for any value of ∆, let from the term that accounts for the accuracy of sam- ∆ = F −1(1). Therefore, pled estimate. If we know cumulative distribution of the sensitivity, we do not suffer from the uncertainty of in- F (F −1(1)) − F (F −1(1)) ≤ ρ n troduced by sampling from the empirical distribution. −1 Fn(F (1)) ≤ 1 + ρ (23)

Since a cumulative distribution function is 1-Lipschitz [[35]],

−1 −1 −1 −1 |Fn(Fn (1)) − Fn(F (1))| ≤ |Fn (1) − F (1)| |F (F −1(1)) − F (F −1(1))| ≤ |∆∗ − ∆ | n n n Sf f ρ ≤ ∆ − ∆∗ f Sf ρ ∆ 1 + ≤ f ∆∗ ∆∗ Sf Sf where we used result from Equation 23 in step 3. Intro-   ρ ducing O ∆∗ completes the proof. Sf

∆Sf Lemma 13. For Laplace Mechanism Lε0 with sam- k pled sensitivity ∆Sf of a query f : D → R and for any ∆Sf Z ⊆ Range(Lε ),

  2 P(Lε0 (x) ∈ Z) P(T ≤ ε) −2ρ n P ln ≤ ε ≥ γ2(1−2e ) P(Lε0 (y) ∈ Z) P(T ≤ ηε0)