Answering Multi-Dimensional Range Queries under Local Differential Privacy

Jianyu Yang1,2∗, Tianhao Wang2, Ninghui Li2, Xiang Cheng1, Sen Su1 1State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China {jyyang, chengxiang, susen}@bupt.edu.cn 2Department of Computer Science, Purdue University, West Lafayette, USA {yang1896, tianhaowang}@purdue.edu, [email protected]

ABSTRACT a meaningful total order. A typical kind of fundamental analysis In this paper, we tackle the problem of answering multi-dimensional over users’ records is multi-dimensional range query, which is a range queries under local differential privacy. There are three key conjunction of multiple predicates for the attributes of interest and technical challenges: capturing the correlations among attributes, asks the fraction of users whose record satisfies all the predicates. avoiding the curse of dimensionality, and dealing with the large In particular, a predicate is a restriction on the range of values of an domains of attributes. None of the existing approaches satisfacto- attribute. However, users’ records regarding these ordinal attributes rily deals with all three challenges. Overcoming these three chal- are highly sensitive. Without strong privacy guarantee, answering lenges, we first propose an approach called Two-Dimensional Grids multi-dimensional range queries over them will put individual pri- (TDG). Its main idea is to carefully use binning to partition the vacy in jeopardy. Thus, developing effective approaches to address two-dimensional (2-D) domains of all attribute pairs into 2-D grids such privacy concerns becomes an urgent need. that can answer all 2-D range queries and then estimate the an- In recent years, local differential privacy (LDP) has come to swer of a higher dimensional range query from the answers of the be the de facto standard for individual privacy protection. Under associated 2-D range queries. However, in order to reduce errors LDP, random noise is added on the client side before the data due to noises, coarse granularities are needed for each attribute is sent to the central server. Thus, users do not need to rely on in 2-D grids, losing fine-grained distribution information for in- the trustworthiness of the central server. This desirable feature of dividual attributes. To correct this deficiency, we further propose LDP has led to wide deployment by industry (e.g., by Google [16], Hybrid-Dimensional Grids (HDG), which also introduces 1-D grids Apple [40], and Microsoft [11]). However, existing LDP solutions [9, to capture finer-grained information on distribution of each individ- 29, 46] are mostly limited to one-dimensional (1-D) range queries ual attribute and combines information from 1-D and 2-D grids to on a single attribute and cannot be well extended to handle multi- answer range queries. To make HDG consistently effective, we pro- dimensional range queries. vide a guideline for properly choosing granularities of grids based In this paper, we tackle the problem of answering multi-dimens- on an analysis of how different sources of errors are impacted by ional range queries under LDP. Given a large number of users who these choices. Extensive experiments conducted on real and syn- have a record including multiple ordinal attributes, an untrusted thetic datasets show that HDG can give a significant improvement aggregator aims at answering all possible multi-dimensional range over the existing approaches. queries over the users’ records while satisfying LDP. To address the problem, we identify three key technical challenges: 1) how PVLDB Reference Format: to capture the correlations among attributes, 2) how to avoid the Jianyu Yang, Tianhao Wang, Ninghui Li, Xiang Cheng, and Sen Su. curse of dimensionality, and 3) how to cope with the large domains Answering Multi-Dimensional Range Queries under Local Differential of attributes. Any approach failing to solve any of these three Privacy. PVLDB, 14(3): 378 - 390, 2021. challenges will have poor utility. As we show in Section 3, none of doi:10.14778/3430915.3430927 the existing approaches or their extensions can deal with all three challenges at the same time. 1 INTRODUCTION Overcoming these three challenges, we first propose an approach Nowadays, users’ data records contain many ordinal or numerical called Two-Dimensional Grids (TDG). Its main idea is to carefully attributes in nature, e.g., income, age, the amount of time viewing use binning to partition the two-dimensional (2-D) domains of all a certain page, the number of times performing a certain actions, attribute pairs into 2-D grids that can answer all possible 2-D range etc. The domains of these attributes consist of values that have queries and then estimate the answer of a higher dimensional range query from the answers of the associated 2-D range queries. How- ∗ Work done while studying as a visiting student at Purdue University. ever, in order to reduce errors due to noises, coarse granularities This work is licensed under the Creative Commons BY-NC-ND 4.0 International are needed for each attribute in 2-D grids, losing fine-grained distri- License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of bution information for individual attributes. When computing the this license. For any use beyond those covered by this license, obtain permission by answer of a 2-D range query by the cells that are partially included emailing [email protected]. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. in the query, it needs to assume a uniform distribution within these Proceedings of the VLDB Endowment, Vol. 14, No. 3 ISSN 2150-8097. cells, which may lead to large errors. To correct this deficiency, we doi:10.14778/3430915.3430927 further propose an upgraded approach called Hybrid-Dimensional

378 Grids (HDG), whose core idea is combining hybrid dimensional user’s privacy is still protected even if the aggregator is malicious. (1-D and 2-D) information for better estimation. In particular, HDG In particular, each user perturbs the value v using a randomized also introduces 1-D grids to capture finer-grained information on algorithm A and reports A(v) to the aggregator. Formally, LDP is distribution of each individual attribute and combines information defined in the following. from 1-D and 2-D grids to answer range queries. In both TDG and Definition 1 (Local Differential Privacy). An algorithm HDG, users are divided into groups, where each group reports in- A(·) satisfies ε-local differential privacyε ( -LDP), where ε ≥ 0, if and formation for one grid. After collecting frequencies of cells in each only if for any pair of inputs (v,v′), and any set R of possible outputs grid under LDP, the aggregator uses techniques to remove negativ- of A, we have ity and inconsistency among grids, and finally employs these grids to answer range queries. Pr [A(v) ∈ R] ≤ eε Pr A(v′) ∈ R . However, it is still nontrivial to make HDG consistently effective,   since the granularities for 1-D and 2-D grids can directly affect 2.2 Categorical Frequency Oracles the performance of HDG. Consequently, it is essential to develop a In LDP, most problems can be reduced to frequency estimation. method for determining the appropriate grid granularities so that Below we present two state-of-the-art Categorical Frequency Oracle HDG can guarantee the desirable utility. In particular in HDG, there (CFO) protocols for these problems. are two main sources of errors: those due to noises generated by Randomized Response. The basic protocol in LDP is random the random nature of LDP and those due to binning. When the response [49]. It was introduced for the binary case, but can be distribution of values is fixed, errors due to binning do not change easily generalized to the categorical setting. Here we present the and can be viewed as bias because of the uniformity assumption, and generalized version of random response (GRR), which enables the errors due to noises can be viewed as variance. Thus choosing the estimation of the frequency of any given value in a fixed domain. granularities of grids can be viewed as a form of bias-variance trade- Here each user with value v ∈ [c] sends the true value v with off. Finer-grained grids lead to greater error due to noises, while probability p, and with probability 1 − p sends a randomly chosen coarser-grained ones result in greater error due to biases. The effect v′ ∈ [c] s.t. v′ , v. More formally, the perturbation function is of each choice depends both on the privacy budget ε, population, defined as and property of the distribution. By thoroughly analyzing the two ε p = e , if y = v sources of errors, we provide a guideline for properly choosing grid ∀ Pr [ ( ) = ] = e ε +c−1 y ∈[c] GRR v y ′ = 1 , (1) granularities under different parameter settings.  p e ε +c−1, if y v By capturing the necessary pair-wise attribute correlations via p ε This satisfies ϵ-LDP since ′ = e . To estimate the frequency of 2-D grids, both approaches overcome the first two challenges. More- p fv for v ∈ [c], one counts how many times v is reported, denoted over, since they properly use binning with the provided guideline ′ 1 = −p 1 = 1 {yi v } , to reduce the error incurred by a large domain, the third challenge by i ∈[n] {yi =v }, and then computes fv n i ∈[n] p−p′ 1 is carefully solved. Therefore, TDG usually performs better than whereÍ {yi =v } is the indicator function that theÍ report yi of the the existing approaches. By also introducing 1-D grids to reduce the i-th user equals v, and n is the total number of users. error due to the uniformity assumption, HDG can give a significant Because each report yi is an independent random variable, by improvement over existing approaches. the linearity of variance, we can show that the variance for this Contributions. To summarize, this paper makes the following estimation is c − 2 + eε contributions: Var [ ] = . fv ε 2 (2) • We propose TDG and HDG for answering multi-dimensional (e − 1) · n range queries under LDP, which include a guideline for Optimized Local Hash. The optimized local hash (OLH) protocol choosing the grid granularities based on analysis of errors deals with a large domain by first using a hash function to compress from different sources. the input domain [c] into a smaller domain [c′], and then applying • We conduct extensive experiments to evaluate the perfor- randomized response to the hashed value. In this protocol, both mance of different approaches using both real and synthetic the hashing step and the randomization step result in information datasets. The results show that HDG outperforms existing loss. The choice of the parameter c′ is a trade-off between loss of approaches by one order of magnitude. information during the hashing step and loss of information during Roadmap. Section 2 provides the preliminaries. Section 3 de- the randomization step. It is shown in [45] that the estimation ′ ′ = ε + scribes the problem statement and four baseline approaches. Sec- variance as a function of c is minimized when c e 1. , tion 4 gives the details of our grid approaches. Section 5 shows In OLH, one reports ⟨H GRR(H(v))⟩ where H is randomly cho- sen from a family of hash functions that hash each value in [c] to a our experimental results. Section 6 reviews related work. Finally, ′ Section 7 concludes this paper. new one in [c ], and GRR(·) is the perturbation function for random ′ = e ε response, while operating on the domain [c ] (thus p e ε +c′−1 2 PRELIMINARIES in Equation (1)). Let ⟨Hi,yi ⟩ be the report from the i-th user. For each value v ∈ [c], to compute its frequency, one first computes 2.1 Local Differential Privacy = = 1 |{i | Hi (v) yi }| i ∈[n] {Hi (v)=yi }, and then transforms it to ′ 1 = −1/c Local differential privacy (LDP) [26] offers a high level of privacy Í = 1 {Hi (v) yi } its unbiased estimation fv ′ . protection, since each user only reports the sanitized data. Each n p−1/c i ∈[Ín]

379 In [45], it is shown that the estimation variance of OLH is Table 1: Notations

4eε Var [fv ] = . (3) Notation Meaning (eε − 1)2 · n n The total number of users Compared with GRR, OLH has a variance that does not depend on d The number of attributes ε c. As a result, for a small c (such that c − 2 < 3e ), GRR is better; c The domain size of an attribute but for a large c, OLH is preferable. b The branching factor of a hierarchy m The number of user groups 2.3 Principle of Dividing Users д The granularity for an ordinal domain Dividing users is one common feature among the existing LDP q The range query works [9, 46, 56]. That is, when multiple pieces of information Aq The set of attributes in q’s interest are needed, the best results are obtained by dividing users into λ The query dimension groups, and then gathering information from each group. This is different from the traditional DP setting [14], where there is a all possible range queries from the n users while satisfying LDP. trusted aggregator having access to raw data records. In DP setting, Please see Table 1 for the list of notations. the privacy budget is split to measure them all. This is because the Key Technical Challenges. To address this problem, we identify estimation variance in LDP setting is linear in the number of users, three key technical challenges: 1) capturing the correlations among while in DP setting, it is a constant. As a result, dividing users into attributes, 2) avoiding the curse of dimensionality, and 3) coping 2 m groups incurs a m multiplicative factor in DP setting (because with the large domains of attributes. Failure to solve any of these the result is multiplied by m), while in LDP setting, this factor is three challenges will lead to poor utility of the results. only m (because the number of users is divided by m). As splitting In the following, we will describe four baseline approaches that privacy budget by m increases variances for both cases by m2, one may handle the problem of answering multi-dimensional range prefers dividing users in LDP setting while splitting privacy budget queries under LDP and analyze how they deal with these challenges. in DP setting (as there is no sampling error). We will also apply this In particular, the first two approaches CALM and HIO are existing principle of dividing users in our proposed approaches. approaches that can be directly applied to this problem. The third approach Low-dimensional HIO (LHIO) is an improvement of HIO. 3 PROBLEM STATEMENT AND BASELINE The last approach Multiplied Square Wave (MSW) is an extension APPROACHES of the existing approach that may answer 1-D range queries. 3.1 Problem Statement 3.2 CALM Consider there are d ordinal attributes {a1, a2, ··· , ad }. Without loss of generality, we assume that all attributes have the same CALM [56] is the state-of-the-art for marginal release under LDP. domain [c] = {1, 2,..., c}, where c is a power of two (if not in In particular, a λ-D marginal means the joint distribution of λ at- real setting, we can simply add some dummy values to achieve tributes. Due to the curse of dimensionality, directly computing it). Let n be the total number of users. The i-th user’s record is a a high-dimensional marginal using a LDP frequency oracle will v = 1, 2,..., d t lead to too much added noise. To solve this problem, CALM pro- d-dimensional vector, denoted by i ⟨vi vi vi ⟩ where vi means the value of attribute at in record vi . poses to collect low-dimensional marginals and reconstruct a high- We focus on the problem of answering multi-dimensional range dimensional marginal from them. We notice that CALM can be queries under LDP. In particular, a multi-dimensional range query is used to answer range queries. In particular, for a λ-D range query, a conjunction of multiple predicates for the attributes in its interest. one can employ CALM to get its answer by directly summing up Formally, a λ-dimensional (λ-D) range query q is defined as the noisy marginals included in the query. CALM only captures necessary pair-wise attribute correlations, = q (at1 , [lt1 , rt1 ]) ∧ (at2 , [lt2 , rt2 ]) ∧ · · · ∧ (atλ , [ltλ , rtλ ]), which effectively overcomes the first two challenges. However, it fails to solve the third challenge. To answer a range query, CALM , , where 1 ≤ tϕ ≤ d, and tϕ tψ when ϕ ψ . We define Aq needs to sum up all noisy marginals in the query, which may result to be {atϕ |1 ≤ ϕ ≤ λ} representing the set of attributes in q’s in a large amount of noise in the answer when c is relatively large. interest. Intuitively, such a query q selects all records whose value , of attribute atϕ is in the interval [ltϕ rtϕ ] for all atϕ ∈ Aq . The 3.3 HIO answer of the query q equals the fraction of these selected records. HIO [46] is a hierarchy-based approach that can directly answer In particular, the real answer of q can be represented as multi-dimensional range queries under LDP. In HIO, given d at- t tributes with the domain [c], the aggregator first constructs a 1-D |{vi | v ∈ [lt , rt ], ∀at ∈ Aq }| f¯ = i . hierarchy for each attribute. To be specific, a 1-D hierarchy isa q n hierarchical collection of intervals with a branching factor b. The In our problem setting, we assume that there is an aggregator root corresponds to the entire domain [c] and is recursively par- that does not have access to the users’ raw records. Our goal is to titioned into b equally sized subintervals until the leaves whose design an approach to enable the aggregator to get the answers of corresponding subintervals only contain one value are reached.

380 = + Thus there are h logb c 1 levels, called one-dim levels, in a is among different 2-D hierarchies. Since each attribute is related 1-D hierarchy. By defining that the root has a level 0, there are bℓ to d − 1 pairs, the frequencies marginalized on it from these d − 1 subintervals in a level ℓ ∈ [0, h]. It is found in [46] that the optimal 2-D hierarchy are usually different. The accuracy of the answers b is around 5. For illustration, we define a d-dim level as a group of of the 2-D range queries will increase if the problem can be solved. d one-dim levels (ℓ1, ℓ2, . . . , ℓd ), each of which comes from one of We identify that the key to remove inconsistency is to solve the these d 1-D hierarchies. Similarly, we define a d-dim interval as a first inconsistency problem, since the second one can be easily group of d intervals, each of which also comes from one of these d solved by the overall consistency in CALM after the first one is 1-D hierarchies. handled. Therefore, we focus on the first problem and develop a Then, the aggregator constructs a d-dimensional hierarchy with new method to enforce consistency within a 2-D hierarchy. Its main these d 1-D hierarchies. A level in the d-dimensional hierarchy idea is to adapt the constrained inference in Hay et al. [22] to a is actually a d-dim level. Thus there are (h + 1)d d-dim levels in 2-D hierarchy and perform the operation twice by starting with the d-dimensional hierarchy. Since there are bℓ subintervals in a the first and second attribute of the attribute pair, respectively. Its one-dim level ℓ in a 1-D hierarchy, a d-dim level (ℓ1, ℓ2, . . . , ℓd ) details are omitted due to space limitation. d LHIO satisfies ε-LDP because the report from each user uses OLH ℓi includes b d-dim intervals. Next, the aggregator randomly and satisfies ε-LDP. We show that by avoiding directly handling i=1 Î divides users into (h + 1)d groups, where each group reports one high-dimensional queries and removing inconsistency, LHIO can d-dim level. After using OLH to get the noisy frequencies of all perform much better than HIO. Similar to CLAM, LHIO overcomes d-dim intervals in every d-dim level, the aggregator can answer a the first two challenges by capturing necessary pair-wise attribute multi-dimensional range query in the following manner. correlations. However, LHIO fails to solve the third challenge. In d + 2 = To answer a λ-D range query LHIO, users are divided into 2 ·(h 1) groups where h logb c. For a relatively large c, it will also bring about excessive noises. =  q (at1 , [lt1 , rt1 ]) ∧ (at2 , [lt2 , rt2 ]) ∧ · · · ∧ (atλ , [ltλ , rtλ ]), the aggregator first expands q to a new d-dimensional range query 3.5 MSW: Multiplied Square Wave q′ that is interested in all d attributes by assigning a specified in- Recently, Li et al. [29] proposed an approach called Square Wave terval [1, c] for each attribute not in Aq . Then, for each attribute in (SW) for estimating the distribution of a single numerical attribute these d attributes, the aggregator finds the least number of subin- under LDP. It takes advantage of the ordinal nature of the domain tervals that can make up its specified interval in q′ from its corre- and reports values that are close to the true value with higher sponding 1-D hierarchy. Finally, the aggregator sums up the noisy probabilities than values that are farther away from the true value. frequencies of all the d-intervals consisting of them to get the an- For handling an attribute with the discrete domain [c], we ini- swer of q′, which is equivalent to that of q. tially normalize it to the continuous domain [0, 1]. Given a value HIO solves the first challenge by capturing the correlations v ∈ [0, 1], SW perturbs it as: among all attributes. However, HIO fails to handle the other two d p, if |v − y| ≤ δ , challenges. In HIO, users are divided into (h + 1) groups where ∀y ∈ [−δ, 1 + δ], Pr [SW(v) = y]= ′ =  p , otherwise , h logb c. When d or c is relatively large, there are too few users in each group, which will incur a high magnitude of added noise in = εe ε −e ε +1 where δ 2e ε (e ε −1−ε) is the łclosenessž threshold. By maximizing the frequencies of d-dim intervals and result in large errors. the difference between p and p′ while satisfying that the total proba- ′ = e ε bility adds up to 1, the values p and p can be derived as p 2δe ε +1 3.4 LHIO: Low-dimensional HIO ′ = 1 and p 2δe ε +1 , respectively. After receiving perturbed reports We observe that CALM achieves good utility by using 2-D marginals from all users, the aggregator runs the Expectation Maximization to reconstruct high-dimensional ones. Using this idea, we can mod- algorithm to find an estimated distribution that maximizes the ify HIO, resulting in a new approach called Low-dim HIO (LHIO). expectation of the observed output. It is shown in [29] that SW Its main idea is to compute the answers of 2-D range queries and outperforms other approaches for answering 1-D range queries. then estimate the answer of a high-dimensional range query from Here we introduce Multiplied Square Wave (MSW), which is d extended from SW to handle multi-dimensional range queries under them. Specifically, the aggregator first generates all 2 attribute pairs from the given d attributes and then randomly divides  users LDP. In MSW, given d attributes, the aggregator randomly divides = d users into d groups, where each group reports one attribute. After into m 2 groups, where each group works on one pair of at- tributes. Next,  for each attribute pair, the aggregator invokes HIO utilizing SW to obtain the distribution of each individual attribute, a to construct a 2-D hierarchy by interacting with its corresponding multi-dimensional range query is answered by using the product of user group. The constructed 2-D hierarchies can be directly used the answers of all associated 1-D range queries. Such approximation to answer all possible 2-D range queries. To estimate the answer implicitly assumes that all attributes are independent. of a higher dimensional range query, the aggregator invokes the In MSW, each user only reports one attribute via SW that satisfies estimation method which will be presented in Section 4.4. ε-LDP. Therefore, this process can ensure ε-LDP for each user. In However, directly using the obtained noisy frequencies will lead addition, the subsequent multiplication post-process steps take to two inconsistency problems in our setting. The first one is within those outputs that are already differentially private and does not a 2-D hierarchy. That is, different levels of the noisy hierarchy may access any user’s raw data. Thus, MSW satisfies ε-LDP. Since MSW give inconsistent estimations due to LDP noise. The second one only collects the information of individual attributes, it solves the

381 (j) last two challenge. However, it fails to handle the first challenge. the identical granularity д1 to construct a 1-D grid G containing MSW totally loses the correlations among attributes, which will д1 1-D cells of equal size for each single attribute aj (1 ≤ j ≤ d). In produce high errors when handling correlated attributes. particular, each 1-D cell specifies a 1-D subdomain consisting of c д1 1-D values. Finally, as in TDG, the aggregator uses OLH to obtain 4 GRID APPROACHES noisy frequencies of cells in each grid. In this section, we first elaborate our grid approaches for answering Phase 2. Removing Negativity and Inconsistency. Due to multi-dimensional range queries under LDP in Section 4.1-4.4. Then using OLH to ensure privacy, the noisy frequency of a cell may be we give their privacy and utility analysis in Section 4.5. Finally, we negative, which violates the prior knowledge that the true one is describe how to choose the proper granularities in Section 4.6. non-negative. Moreover, since an attribute is related to multiple grids, the noisy frequencies integrated on the attribute in different 4.1 Overview grids may be different, leading to inconsistency among grids. In As analysed in Section 3, none of the baseline approaches can this phase, to improve the utility, the aggregator post-processes the overcome all three key challenges. To address this problem, we first constructed grids to remove the negativity and inconsistency. The propose an approach called Two-Dimensional Grid (TDG). Its main difference between TDG and HDG is that TDG only requires the idea is to carefully use binning to partition the 2-D domains of all aggregator to handle 2-D grids while 1-D and 2-D grids needs to be attribute pairs into 2-D grids that can answer all 2-D range queries handled together in HDG. We describe the detail for post-processing and then estimate the answer of a higher dimensional range query grids in Section 4.2. from the answers of the associated 2-D range queries. Phase 3. Answering Range Queries. In this phase, the aggre- However, since values within the same cell in a grid are reported gator can answer all multi-dimensional range queries. We first together, the aggregator cannot tell the distribution within each cell describe how to answer a 2-D range query. For ease of illustration, = and only assumes a uniform distribution. When computing the an- we take a 2-D range query q0 interested in Aq0 {a1, a2} as an swer of a 2-D range query by the cells that are partially included in example. In TDG, to get the answer fq0 of q0, the aggregator first (1,2) the query, this may lead to large error due to the uniformity assump- finds the 2-D grid G corresponding to Aq0 and then checks all tion. To correct this deficiency, we further propose an upgraded 2-D cells in G(1,2) in the following manner. If a cell is completely approach called Hybrid-Dimensional Grid (HDG), which also intro- included in q0, the aggregator includes its noisy frequency in fq0 ; duces finer-grained 1-D grids and combines the information from if a cell is partially included, the aggregator estimates the sum of 1-D and 2-D grids to answer range queries. frequencies of common 2-D values between the cell and q0 by uni- Note that the first two challenges pose a dilemma: capturing form guess, i.e., assuming that the frequencies of 2-D values within full correlations (as HIO) will lead to the curse of dimensionality; the cell are uniformly distributed and then adds the sum to fq0 . while only focusing on individual attributes (as MSW) will totally In HDG, the aggregator treats those cells partially included in lose correlation information. In CALM [56], the similar dilemma q0 using a response matrix rather than uniform guess, which can is solved by using 2-D marginals to reconstruct high-dimensional significantly improve the accuracy of results. To be specific, for ones, which achieves a good trade-off when handling these two each attribute pair (aj, ak ), the aggregator first employs the three challenges. Inspired by this idea, both TDG and HDG choose to grids {G(j),G(k),G(j,k)} to build a response matrix M(j,k) before capture the necessary pair-wise attribute correlations via 2-D grids, answering 2-D range queries. In particular, the matrix M(j,k) con- which overcomes the first two challenges. The third challenge is sists of c × c elements that are in one-to-one correspondence with also carefully solved in TDG and HDG by properly using binning the estimated frequencies of 2-D values in the 2-D domain [c] × [c] with the guideline to reduce the error incurred by a large domain. of (aj, ak ). The details of response matrix generation are given in Specifically, both TDG and HDG consist of three phases: Section 4.3. When calculating the answer fq0 of the 2-D query q0 Phase 1. Constructing Grids. In TDG, from the given d at- in HDG, the aggregator also checks all 2-D cells in the grid G(1,2) d tributes, the aggregator first generates all 2 attribute pairs. Then corresponding to Aq0 . For a cell completely included in q0, the ag- = d gregator includes its noisy frequency in f as in TDG; for a cell the aggregator randomly divides users intom 2 groups, each of q0 which corresponds to one pair. Next, for each attribute  pair (aj, ak ) partially included in the query q0, the aggregator identifies the where 1 ≤ j < k ≤ d, the aggregator assigns the same granularity common 2-D values between this cell and q0, and then adds the (j,k) (1,2) д2 to construct a 2-D grid G by partitioning the 2-D domain sum of their corresponding elements in M to fq0 . [c] × [c] into д2 × д2 2-D cells of equal size. In particular, each 2-D For a λ-D range query where λ > 2, its answer cannot be directly c c obtained from the constructed 2-D grids or response matrices. To cell specifies a 2-D subdomain consisting of д × д 2-D values. 2 2 λ Finally, to obtain noisy frequencies of cells in each grid, the aggre- answer this λ-D query, we propose to split it into 2 associated gator instructs each user in the group corresponding to the grid to 2-D range queries and then estimate its answer from all answers of λ report which cell his/her private value is in using OLH. these 2 2-D queries. We discuss it in detail in Section 4.4. In HDG, the aggregator also constructs d 1-D grids for the d  + d attributes, respectively. Thus there will be d 2 grids in HDG 4.2 Post-Process for Grids = + d and users are divided into m d 2 groups,  each of which The post-process for grids contains two basic steps including non- d corresponds to one of these grids. In addition  to constructing 2 2- negativity step and consistency step, which are used to remove D grids with granularity д2 as TDG, in HDG, the aggregator assigns  negativity and inconsistency, respectively.

382 Non-Negativity Step. In this step, the aggregator handles the Algorithm 1 Building Response Matrix estimated frequencies of cells in each grid by Norm-Sub [48], which Input: Grids {G(j),G(k),G(j,k)}, domain size c can make all estimates non-negative and sum up to 1. In Norm-Sub, Output: Response matrix M(j,k) firstly, all negative estimates are converted to 0. Then thetotal ( , ) 1 1: initialize all c × c elements in the matrix M j k as ; difference between 1 and the sum of positive estimates is calculated. c2 2: repeat Next, the average difference is obtained through dividing the total ( ) ( ) ( , ) 3: { j , k , j k } difference by the number of positive estimates. Finally, every posi- for each grid G in G G G do tive estimate is updated by subtracting the average difference. The 4: for each cell s in G do 5: Φ( ) process is repeated until all estimates become non-negative. Find the set of 2-D values s corresponding to s; = (j,k) 6: Calculate Y M [βj, βk ]; Consistency Step. We first describe how to achieve consistency , Φ (βj βkÍ)∈ (s) on an attribute among grids. For an attribute a, it is related to d 7: if Y , 0 then − Φ grids in total, which includes one 1-D grid and d 1 2-D grids. 8: for each 2-D value (βj, βk ) in (s) do Assume these d grids are {G ,G , ···G }. For an integer j ∈ [1,д ], (j,k) , 1 2 d 2 (j,k) , M [βj βk ] 9: M [βj βk ] ← Y · fs ; we define PGi (a, j) to be the sum of frequencies of Gi ’s cells whose c + c 10: until convergence specified subdomain corresponds to a is in [(j −1)× д 1, j × д ]. To 2 2 ( , ) 11: return M j k make all PGi (a, j) consistent, we compute their weighted average d as P(a, j) = θi · PG (a, j), where θi is the weight of PG (a, j). = i i i 1 Algorithm 1 provides the details of building response matrix To get a betterÍ estimation, we need to carefully set the value of (j,k) ( , ) { (j), (k), (j,k)} θ . Our goal is to minimize the variance of P(a, j), i.e. Var [P(a, j)] = M for attribute pair aj ak . It takes grids G G G i (j,k) d d and domain size c as inputs and outputs the response matrix M . 2 Var , = 2 Var (j) (k) (j,k) θi · PGi (a j) θi · |Si | · 0, where Si is the set In Algorithm 1, for each grid G in {G ,G ,G }, the aggregator i=1 i=1 Í   Í Var performs the following update process on M(j,k). For each cell s in of cells whose frequencies contribute to PGi (a, j) and 0 is the basic variance for estimating a single cell (we assume each user G, the aggregator first finds the set of 2-D values Φ(s) corresponding group has the same population). Apparently, if G is 1-D, S = д1 ; to s, which means that Φ(s) consists of all those 2-D values whose i i д2 if G is 2-D, S = д . Based on the analysis in [56], we have frequency can contribute to the frequency fs of cell s. To illustrate i i 2 (j,k) d the definition of Φ(s), we take a 2-D cell s in G as an example. = 1 1 , = j j k k θi |S | / |S | and the optimal weighted average is P(a j) Assume the 2-D cell s specifies a 2-D subdomain [l , r ] × [l , r ], i i=1 i s s s s Í j j k k d d where [ls, rs ] and [ls , rs ] correspond to aj and ak , respectively. 1 , 1 , |S | · PGi (a j) / |S | . Once the P(a j) is obtained, we need Then, Φ(s) can be represented as i=1 i  i=1 i Í Í to make each PG (a, j) equal it, which can be achieved in the follow- Φ = j j k k i (s) {(βj, βk )|βj ∈ [ls, rs ], βk ∈ [ls , rs ]}. ing manner. For each cell in Si , we update its frequency by adding Note that this representation is also applicable to a 1-D cell s in the amount of change P(a, j) − P (a, j) /|Si |. Gi (j) (k) To achieve consistency among all attributes, we can use the above G (or G ), since we can equivalently transform its specified 1-D j j k k j j method one by one for each single attribute. It is shown in [34] that subdomain [ls, rs ] (or [ls , rs ]) into 2-D subdomain [ls, rs ]×[1, c] (or k k Φ following any order of these attributes, a later consistency step will [1, c] × [ls , rs ].). With (s), the aggregator updates the elements in not invalidate consistency established in previous steps. M(j,k) as Lines 6-9 in Algorithm 1. This update process is repeated Note that applying the consistency step may incur negativity, until convergence. and vise versa. Thus in the post-process, we interchangeably in- In Algorithm 1, the convergence criteria is that the sum of the voke these two steps multiple times. Since we need to ensure non- changes of all elements in the response matrix after each update negativity for the response matrix generation in Phase 3, we end process is lower than a given threshold. By comparing the results the post-process with the non-negativity step. While the last step of setting different thresholds, we found that the results are almost 1 may again introduce inconsistency, it tends to be very small. the same so long as threshold is smaller than n . 4.4 Estimation for λ-D Range Query To estimate the answer f of a λ-D range query 4.3 Response Matrix Generation q q = (at , [lt , rt ]) ∧ (at , [lt , rt ]) ∧ · · · ∧ (at , [lt , rt ]) For an attribute pair (aj, ak ), it corresponds to the response matrix 1 1 1 2 2 2 λ λ λ M(j,k) of size c ×c, where the element M(j,k)[β , β ] represents the = λ j k where Aq {atϕ |1 ≤ ϕ ≤ λ}, the aggregator first splits q into 2 estimated frequency of 2-D value (βj, βk ) in the [c] × [c] 2-D do- associated 2-D range queries  (j,k) main of (aj, a ). To build M , we propose to invoke the efficient k (j,k) = ( , [ , ]) ∧ ( , [ , ])| , ∈ , estimation method Weighted Update [2, 20] with the three grids q aj lj rj ak lk rk aj ak Aq (j) (k) (j,k) n o {G ,G ,G } corresponding to {aj, ak , (aj, ak )}, respectively. λ and then gets their answers f (j,k) | a , a ∈ A as described Its main idea is to keep using the information on each cell in these 2 q j k q  n λo three grids to update the matrix until each cell’s frequency equals in Section 4.1. Finally, the aggregator uses these 2 2-D queries’ the sum of its corresponding elements in the matrix. answers to estimate fq . 

383 In general, such an estimation problem can be solved by Maxi- Algorithm 2 Estimating Answer of λ-D Range Query mum Entropy Optimization [34, 56]. Due to space limitation, we , refer the readers to our full version [52] for its description. However, Input: Associated 2-D queries’ answers fq(j,k) | aj ak ∈ Aq n o in experiments, we observe that Maximum Entropy Optimization Output: Estimated answer vector z 1 1: initialize all 2λ elements in the vector z as ; cannot converge quickly in some cases. Therefore, we propose to 2λ use Weighted Update [2, 20] to solve this estimation problem, which 2: repeat can achieve almost the same accuracy while with higher efficiency. 3: for each fq(j,k) in fq(j,k) | aj, ak ∈ Aq do Algorithm 2 gives the procedure of estimating the answer of n (j,k) o (j,k) λ 4: Find the set of queries Q(q) corresponding to q ; a λ-D range query q. It takes the answers of associated 2-D ′ 2 5: Calculate Y = z[q ]; z queries as inputs and outputs a estimated answer  vector . In par- q′ ∈Q(q)(j,k) λ Í ticular, the vector z consists of 2 elements that are in one-to-one 6: if Y , 0 then λ ′ ( , ) correspondence with the answers of 2 λ-D queries in 7: for each query q in Q(q) j k do ′ z[q′] 8: z[q ] ← · f (j,k) ; ′ Y q Q(q) = {∧t (at , [lt , rt ] or [lt , rt ] ) | at ∈ Aq }, 9: until convergence ′ 10: return z where the interval [lt , rt ] is the complement of [lt , rt ] on the do- main of at . In Algorithm 2, for each fq(j,k) in fq(j,k) | aj, ak ∈ Aq , the aggregator performs the following updaten process on z. Theo group to represent those obtained from the entire population, since aggregator first finds the set of λ-D queries Q(q)(j,k) corresponding the user group may have different distribution from the global one. to the 2-D query q(j,k), which means that Q(q)(j,k) consists of all The noise and sampling errors can be quantified together. Sup- pose the estimation is run on a sample Dη of the dataset D. We those λ-D queries whose answer can contribute to f (j,k) . In partic- q ( ) ¯ ( ) (j,k) λ−2 use fv X and fv X to denote the estimated and true frequencies ular, Q(q) contains 2 λ-D queries from Q(q) and is defined as of v in X, respectively. For simplicity, the frequency on the origi- ′ (j,k) ¯ ¯ ∧t (at , [lt , rt ] or [lt , rt ] ) ∧ q | at ∈ Aq /{aj, ak } . Then, the ag- nal dataset fv (D) is written as fv . The expected squared error for n o estimating one value is gregator calculates the sum Y of z[q′] for all q′ ∈ Q(q)(j,k), where ′ ′ 2 2 2 z[q ] is the element corresponding to the answer of q . Next, the E fv (Dη ) − f¯v =E fv (Dη ) − f¯v (Dη ) + E f¯v (Dη ) − f¯v + h i h i h i aggregator uses f (j,k) to update the elements in z as Lines 6-8. This    q 2E (fv (Dη ) − f¯v (Dη )) · (f¯v (Dη ) − f¯v ) . (4) process is repeated until convergence. The estimated answer fq of   the λ-D query q equals its corresponding element in z, i.e., z[q]. Specifically, Equation (4) consists of three parts. The first part is the variance of frequency oracle, i.e., In Algorithm 2, the convergence criteria is that the sum of the p′(1 − p′) f¯ (p − p′)(1 − p − p′) changes of all elements in the estimated vector after each update E f (D ) − f¯ (D ) 2 =m · + m · v . v η v η n(p − p′)2 n(p − p′)2 process is lower than a given threshold. We also found that the h  i results are almost the same so long as threshold is smaller than 1 . In the case of OLH, we have p = 1/2, p′ = 1/(eε + 1), and the n ε quantity equals 4me + m · f¯ . n(e ε −1)2 n v 4.5 Privacy and Utility Analysis E ¯ ¯ 2 = m−1 ¯ ¯ The second part is fv (Dη ) − fv n−1 fv (1 − fv ). And h i Privacy Guarantee. We claim that both TDG and HDG satisfy the third part is 2E (fv (D η ) − f¯v (Dη )) · (f¯v (Dη ) − f¯v ) = 0. We ε-LDP because all the information from each user to the aggre- observe that the second part is a constant which is much smaller gator goes through OLH with ε as privacy budget, and no other m ¯ than the first part. Ignoring the small factor n · fv in the first part, information is leaked. the expected squared noise and sampling error can be dominated 4me ε Error Analysis. Below we analyze the expected squared error by n(e ε −1)2 . Due to space limitation, we refer the readers to our between the true query answer and the estimated answer. There full version [52] for the detailed derivation of the above equations. are four kinds of errors: noise error, sampling error, non-uniformity Non-Uniformity Error. Non-uniformity error is caused by cells error, and estimation error. that intersect with the query rectangle, but are not contained in it. Noise and Sampling Error. The noise error is due to the use of For these cells, we need to estimate how many data points are in LDP frequency oracles. To satisfy LDP, one adds, to each cell, an the intersected cells assuming that the data points are uniformly independently generated noise, and these noises have the same distributed, which will lead to non-uniformity error when the data standard deviation. When summing up the noisy frequencies of points are not uniformly distributed. The magnitude of this error cells to answer a query, the noise error is the sum of the correspond- in any intersected cell, in general, depends on the number of data ing noises. As these noises are independently generated zero-mean points in that cell, and is bounded by it. Therefore, the finer the par- random variables, they cancel each other out to a certain degree. tition granularity, the lower the non-uniformity error. Calculating In fact, because these noises are independently generated, the vari- precise non-uniformity error requires the availability of the true ance of their sum equals the sum of their variances. Therefore, the data distribution, which is not the case in our setting. Thus we opt finer granularity one partitions the domain into, the more cellsare to compute the approximate non-uniformity error. included in a query, and the larger the noise error is. The sampling Estimation Error. When estimating the answer of a λ-D range error is incurred by using cells’ frequencies obtained from a user query where λ > 2 from the associated answers of 2-D range

384 queries, estimation error will occur. Since the estimation error is constant α2. Our goal is to select д2 to minimize the sum of the two 2 ε 2 dataset dependent, there is no formula for estimating it. In general, (д2) ·m2e + 2α2 squared errors ε 2 д . To achieve this goal, д2 should more accurate answers of 2-D range queries can result in a smaller n2(e −1)  2  estimation error. However, its feature that the magnitude is depen- ε n2 be 2α2 ·(e − 1)· m e ε . dent on the dataset will introduce uncertainty, which means that r q 2 an opposite result may appear in a few cases. Discussion. In the analysis of non-uniformity error, for a cell that contributes to this error, we calculate the expected sum of 4.6 Choosing Granularities frequencies of values in this cell based on the uniformity assump- tion. Although this assumption may lead to the deviation between Since the granularities д1,д2 can directly affect the utility of our gird approaches, we propose the following guideline for properly the calculated error and the true one, it helps the analysis become choosing them. more general for diverse datasets. Moreover, since 1-D grids are Guideline: To minimize the sum of squared noise and sampling finer-grained, this deviation’s influence on the performance ofHDG tends to be negligible. Thus, such an assumption still makes our error and squared non-uniformity error, the granularity д1 for 1-D guideline consistently effective for HDG when handling diverse ε 2 2 3 n1 ·(e −1) ·α1 grids should be д1 = ε ; the granularity д2 for 2-D datasets. Note that the recommended values of {α1, α2} are ob- r 2m1e tained by tuning them on synthetic datasets under different setting ε n2 grids should be computed as д = 2α ·(e − 1)· ε , where of n, c,d, which does not leak any real users’ private information. 2 2 m2e r q Besides, all other necessary parameters for choosing granularities ε is the total privacy budget, n (i = 1, 2) is the number of users used i are derived from public background knowledge and do not require for i-D grids, m (i = 1, 2) is the number of user groups for i-D grids, i the aggregator to access raw data. Therefore, configuring TDG and and {α , α } are some small constants depending on the dataset. 1 2 HDG with our guideline will not lead to any privacy leakage. For simplicity, we make each user group have the same population, i.e. n2 = n for TDG and n1 = n2 = n for HDG. To ensure m2 d m1 m2 + d (2) d (2) 5 EXPERIMENTAL EVALUATION that д and д are divisible by domain size c at the same time, for 1 2 In this section, we aim to answer the following questions: (1) how each of them, we take the power of two closest to its derived value does our proposed HDG perform, (2) how can different parameters as the final value. If the obtained granularity is larger than c, we affect the results, and 3) how effective is the guidance for choosing set it to c by default. Our experimental results suggest that setting granularities given by our guideline. α1 = 0.7 and α2 = 0.03 can typically achieve good performance across different datasets. 5.1 Setup Analysis on д1. A range query on a 1-D grid specifies a query interval on the attribute corresponding to the grid. For an average Datasets. We make use of two real datasets and two synthetic case, we consider that the ratio of this interval to the attribute’s datasets in our experiments. 1 domain size is 2 . When answering the query from a 1-D grid with • Ipums [38]: It is from the Integrated Public Use Microdata д1 granularity д1, there are roughly 2 cells included in this query. Series and has around 1 million records of the United States ε ε д1 4m1e = 2д1m1e census in 2018. The squared noise and sampling error is 2 · ε 2 ε 2 . n1(e −1) n1(e −1) • Bfive25 [ ]: It is collected through an interactive on-line per- The non-uniformity error is proportional to the sum of frequen- sonality test and contains around 1 million records. Each cies of values in the cells that intersect with the two sides of the record describes the time spent on each question in millisec- query interval. Assuming that the non-uniformity error is α1 for д1 onds. 2 some constant α , then it has a squared error of α1 . • Normal: This dataset is synthesized from multivariate nor- 1 д1   ε mal distribution with mean 0, standard deviation 1. The 2д1m1e + The minimize the sum of the two squared errors ε 2 n1(e −1) covariance between every two attributes is 0.8. 2 ε 2 2 • Laplace: This dataset is synthesized from multivariate laplace α1 3 n1 ·(e −1) ·α1 д , we should set д1 to 2m e ε . distribution with mean 0, standard deviation 1. The covari-  1  r 1 ance between every two attributes is 0.8. Analysis on д2. Here we extend the above analysis to the 2-D grid setting. For a 2-D query, we assume that the ratio of each query For the first two real datasets, we sample 1 million user records. 1 interval to its corresponding attribute’s domain size is 2 . Then the To experiment with different numbers of users, we generate multi- ε 2 ε д2 2 4m2e = (д2) ·m2e ple test datasets from the two synthetic datasets with the number squared noise and sampling error is ( 2 ) · ε 2 ε 2 . n2(e −1) n2(e −1) of users ranging from 100k to 10M. For evaluation varying different The non-uniformity error is proportional to the sum of the fre- numbers of attributes and domain sizes, we generate multiple ver- quencies of values in the cells that fall on the four edges of the sions of these four datasets with the number of attributes ranging д2 = query rectangle. The query rectangle’s edges contain 4 · 2 2д2 from 3 to 10 and their domain sizes ranging from 24 to 210. cells; and the expected sum of frequencies of values included in 1 2 Competitors. We compare HDG against TDG and all the baseline these cells is 2д2 · = . Similar to the 1-D grid setting, we д2×д2 д2 approaches including HIO, CALM, MSW and LHIO. In addition, we assume that the non-uniformity error on average is some portion of 2 add a benchmark approach Uni which always outputs a uniform 2α2 it. Then the squared error from non-uniformity is д for some guess. In particular, we set the branch factorb = 4 for HIO and LHIO.  2 

385 386 387 388 389 REFERENCES [30] Ryan McKenna, Raj Kumar Maity, Arya Mazumdar, and Gerome Miklau. 2020. A [1] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. 2019. Hadamard Response: workload-adaptive mechanism for linear queries under local differential privacy. Estimating Distributions Privately, Efficiently, and with Little Communication. PVLDB 13, 11 (2020), 1905ś1918. [31] Ryan McKenna, Gerome Miklau, Michael Hay, and Ashwin Machanavajjhala. In AISTATS, Vol. 89. PMLR, 1120ś1129. 2018. Optimizing error of high-dimensional statistical queries under differential [2] Sanjeev Arora, Elad Hazan, and Satyen Kale. 2012. The Multiplicative Weights privacy. PVLDB 11, 10 (2018), 1206ś1219. Update Method: a Meta-Algorithm and Applications. Theory Comput. 8, 1 (2012), [32] Wahbeh H. Qardaji, Weining Yang, and Ninghui Li. 2013. Differentially private 121ś164. grids for geospatial data. In ICDE. IEEE Computer Society, 757ś768. [3] Raef Bassily. 2019. Linear Queries Estimation with Local Differential Privacy. In [33] Wahbeh H. Qardaji, Weining Yang, and Ninghui Li. 2013. Understanding Hi- AISTATS (Proceedings of Machine Learning Research), Vol. 89. PMLR, 721ś729. erarchical Methods for Differentially Private Histograms. PVLDB 6, 14 (2013), [4] Raef Bassily, , Uri Stemmer, and Abhradeep Guha Thakurta. 2017. 1954ś1965. Practical Locally Private Heavy Hitters. In NIPS. 2288ś2296. [34] Wahbeh H. Qardaji, Weining Yang, and Ninghui Li. 2014. PriView: practical [5] Raef Bassily and Adam D. Smith. 2015. Local, Private, Efficient Protocols for differentially private release of marginal contingency tables. In SIGMOD. ACM, Succinct Histograms. In STOC. ACM, 127ś135. 1435ś1446. [6] Mark Bun, Jelani Nelson, and Uri Stemmer. 2018. Heavy Hitters and the Structure [35] Zhan Qin, Yin Yang, Ting Yu, Issa Khalil, Xiaokui Xiao, and Kui Ren. 2016. Heavy of Local Privacy. In PODS. ACM, 435ś447. Hitter Estimation over Set-Valued Data with Local Differential Privacy. In CCS. [7] Rui Chen, Haoran Li, A. Kai Qin, Shiva Prasad Kasiviswanathan, and Hongxia ACM, 192ś203. Jin. 2016. Private spatial data aggregation in the local setting. In ICDE. IEEE [36] Zhan Qin, Ting Yu, Yin Yang, Issa Khalil, Xiaokui Xiao, and Kui Ren. 2017. Computer Society, 289ś300. Generating Synthetic Decentralized Social Graphs with Local Differential Privacy. [8] Graham Cormode, Tejas Kulkarni, and Divesh Srivastava. 2018. Marginal Release In CCS. ACM, 425ś438. Under Local Differential Privacy. In SIGMOD. ACM, 131ś146. [37] Xuebin Ren, Chia-Mu Yu, Weiren Yu, Shusen Yang, Xinyu Yang, Julie A. McCann, [9] Graham Cormode, Tejas Kulkarni, and Divesh Srivastava. 2019. Answering Range and Philip S. Yu. 2018. LoPub: High-Dimensional Crowdsourced Data Publication Queries Under Local Differential Privacy. PVLDB 12, 10 (2019), 1126ś1138. With Local Differential Privacy. TIFS 13, 9 (2018), 2151ś2166. [10] Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, Entong Shen, and [38] Steven Ruggles, J. Trent Alexander, Katie Genadek, Ronald Goeken, Matthew B. Ting Yu. 2012. Differentially Private Spatial Decompositions. In ICDE. IEEE Schroeder, and Matthew Sobek. 2010. Integrated Public Use Microdata Series: Computer Society, 20ś31. Version 5.0 [Machine-readable database]. [11] Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. 2017. Collecting Telemetry [39] Haipei Sun, Xiaokui Xiao, Issa Khalil, Yin Yang, Zhan Qin, Wendy Hui Wang, Data Privately. In NIPS. 3571ś3580. and Ting Yu. 2019. Analyzing Subgraph Statistics from Extended Local Views [12] John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. 2013. Local Privacy with Decentralized Differential Privacy. In CCS. ACM, 703ś717. and Statistical Minimax Rates. In FOCS. IEEE Computer Society, 429ś438. [40] Apple Differential Privacy Team. 2017. Learning with Privacy at Scale, avail- [13] John C. Duchi, Martin J. Wainwright, and Michael I. Jordan. 2013. Local Privacy able at http://machinelearning.apple.com/docs/learning-with-privacy-at-scale/ and Minimax Bounds: Sharp Rates for Probability Estimation. In NIPS. 1529ś1537. appledifferentialprivacysystem.pdf. [14] , Frank McSherry, Kobbi Nissim, and Adam D. Smith. 2006. Cali- [41] Ning Wang, Xiaokui Xiao, Yin Yang, Ta Duy Hoang, Hyejin Shin, Junbum Shin, brating Noise to Sensitivity in Private Data Analysis. In TCC (Lecture Notes in and Ge Yu. 2018. PrivTrie: Effective Frequent Term Discovery under Local Computer Science), Vol. 3876. Springer, 265ś284. Differential Privacy. In ICDE. IEEE Computer Society, 821ś832. [15] Alexander Edmonds, Aleksandar Nikolov, and Jonathan Ullman. 2020. The power [42] Ning Wang, Xiaokui Xiao, Yin Yang, Jun Zhao, Siu Cheung Hui, Hyejin Shin, of factorization mechanisms in local and central differential privacy. In STOC. Junbum Shin, and Ge Yu. 2019. Collecting and Analyzing Multidimensional Data ACM, 425ś438. with Local Differential Privacy. In ICDE. IEEE, 638ś649. [16] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. 2014. RAPPOR: Ran- [43] Shaowei Wang, Liusheng Huang, Yiwen Nie, Pengzhan Wang, Hongli Xu, and domized Aggregatable Privacy-Preserving Ordinal Response. In CCS. ACM, 1054ś Wei Yang. 2018. PrivSet: Set-Valued Data Analyses with Locale Differential 1067. Privacy. In INFOCOM. IEEE, 1088ś1096. [17] Xiaolan Gu, Ming Li, Yang Cao, and Li Xiong. 2019. Supporting Both Range [44] Shaowei Wang, Yuqiu Qian, Jiachun Du, Wei Yang, Liusheng Huang, and Hongli Queries and Frequency Estimation with Local Differential Privacy. In CNS. IEEE, Xu. 2020. Set-valued Data Publication with Local Privacy: Tight Error Bounds 124ś132. and Efficient Mechanisms. PVLDB 13, 8 (2020), 1234ś1247. [18] Xiaolan Gu, Ming Li, Yueqiang Cheng, Li Xiong, and Yang Cao. 2020. PCKV: Lo- [45] Tianhao Wang, Jeremiah Blocki, Ninghui Li, and Somesh Jha. 2017. Locally cally Differentially Private Correlated Key-Value Data Collection with Optimized Differentially Private Protocols for Frequency Estimation. In USENIX Security. Utility. In USENIX Security. USENIX Association, 967ś984. USENIX Association, 729ś745. [19] Xiaolan Gu, Ming Li, Li Xiong, and Yang Cao. 2020. Providing Input- [46] Tianhao Wang, Bolin Ding, Jingren Zhou, Cheng Hong, Zhicong Huang, Ninghui Discriminative Protection for Local Differential Privacy. In ICDE. IEEE, 505ś516. Li, and Somesh Jha. 2019. Answering Multi-Dimensional Analytical Queries [20] Moritz Hardt, Katrina Ligett, and Frank McSherry. 2012. A Simple and Practical under Local Differential Privacy. In SIGMOD. ACM, 159ś176. Algorithm for Differentially Private Data Release. In NIPS. 2348ś2356. [47] Tianhao Wang, Ninghui Li, and Somesh Jha. 2018. Locally Differentially Private [21] Michael Hay, Ashwin Machanavajjhala, Gerome Miklau, Yan Chen, and Dan Frequent Itemset Mining. In SP. IEEE Computer Society, 127ś143. Zhang. 2016. Principled Evaluation of Differentially Private Algorithms using [48] Tianhao Wang, Milan Lopuhaä-Zwakenberg, Zitao Li, Boris Skoric, and Ninghui DPBench. In SIGMOD. ACM, 139ś154. Li. 2020. Locally Differentially Private Frequency Estimation with Consistency. [22] Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. 2010. Boosting In NDSS. The Internet Society. the Accuracy of Differentially Private Histograms Through Consistency. PVLDB [49] Stanley L. Warner. 1965. Randomized Response: A Survey Technique for Elimi- 3, 1 (2010), 1021ś1032. nating Evasive Answer Bias. J. Amer. Statist. Assoc. 60, 309 (1965), 63ś69. [23] Justin Hsu, Sanjeev Khanna, and Aaron Roth. 2012. Distributed Private Heavy [50] Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. 2010. Differential privacy Hitters. In ICALP, Vol. 7391. Springer, 461ś472. via wavelet transforms. In ICDE. IEEE Computer Society, 225ś236. [24] Matthew Joseph, Aaron Roth, Jonathan Ullman, and Bo Waggoner. 2018. Local [51] Jianyu Yang, Xiang Cheng, Sen Su, Rui Chen, Qiyu Ren, and Yuhan Liu. 2019. Differential Privacy for Evolving Data. In NIPS. 2381ś2390. Collecting Preference Rankings Under Local Differential Privacy. In ICDE. IEEE, [25] Kaggle. [n.d.]. Big Five Personality Test. http://www.kaggle.com/tunguz/ 1598ś1601. big-five-personality-test/data. [52] Jianyu Yang, Tianhao Wang, Ninghui Li, Xiang Cheng, and Sen Su. 2020. Answer- [26] Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhod- ing Multi-Dimensional Range Queries under Local Differential Privacy. CoRR nikova, and Adam D. Smith. 2008. What Can We Learn Privately?. In FOCS. IEEE abs/2009.06538 (2020). https://arxiv.org/abs/2009.06538 Computer Society, 531ś540. [53] Jianyu Yang, Tianhao Wang, Ninghui Li, Xiang Cheng, and Sen Su. 2020. Source [27] Chao Li, Michael Hay, Gerome Miklau, and Yue Wang. 2014. A Data- and Code of Approaches. [Online]. http://github.com/YangJianyu-bupt/privmdr. Workload-Aware Query Answering Algorithm for Range Queries Under Differ- [54] Min Ye and Alexander Barg. 2018. Optimal Schemes for Discrete Distribution ential Privacy. PVLDB 7, 5 (2014), 341ś352. Estimation Under Locally Differential Privacy. IEEE Trans. Inf. Theory 64, 8 (2018), [28] Chao Li, Michael Hay, Vibhor Rastogi, Gerome Miklau, and Andrew McGregor. 5662ś5676. 2010. Optimizing linear counting queries under differential privacy. In PODS. [55] Qingqing Ye, Haibo Hu, Xiaofeng Meng, and Huadi Zheng. 2019. PrivKV: Key- ACM, 123ś134. Value Data Collection with Local Differential Privacy. In SP. IEEE, 317ś331. [29] Zitao Li, Tianhao Wang, Milan Lopuhaä-Zwakenberg, Ninghui Li, and Boris [56] Zhikun Zhang, Tianhao Wang, Ninghui Li, Shibo He, and Jiming Chen. 2018. Skoric. 2020. Estimating Numerical Distributions under Local Differential Privacy. CALM: Consistent Adaptive Local Marginal for Marginal Release under Local In SIGMOD. ACM, 621ś635. Differential Privacy. In CCS. ACM, 212ś229.

390