Building a Collaborative Phone Blacklisting System with Local Differential Privacy

Daniele Ucci Roberto Perdisci Department of Computer, Control, and Management Engi- University of Georgia neering“AntonioRuberti”,“LaSapienza”UniversityofRome Atlanta, Georgia Rome, Italy [email protected] [email protected] Jaewoo Lee Mustaque Ahamad University of Georgia Georgia Institute of Technology Atlanta, Georgia Atlanta, Georgia [email protected] [email protected]

ABSTRACT of data sources, including user complaints, phone honeypot call de- Spam phone calls have been rapidly growing from nuisance to an tail records (CDRs) and call recordings. Existing commercial apps, increasingly effective scam delivery tool. To counter this increas- such as Youmail [31] and TouchPal [21], mostly base their blocking ingly successful attack vector, a number of commercial smartphone approach on user complaints. Other popular apps, such as True- apps that promise to block spam phone calls have appeared on app Caller [26], also use information collected from users’ contact lists stores, and are now used by hundreds of thousands or even millions to distinguish between possible legitimate and unknown/unwanted 1 of users. However, following a business model similar to some online calls . However, in the recent past TrueCaller has experienced signif- social network services, these apps often collect call records or other icant backlash due to privacy concerns related to the sharing of users’ potentially sensitive information from users’ phones with little or contact lists with a centralized service. recently implemented no formal privacy guarantees. a built-in feature in Android phones to protect against possible spam In this paper, we study whether it is possible to build a practi- calls. Nonetheless, Android phones may send information about cal collaborative phone blacklisting system that makes use of local received calls to Google without strong privacy guarantees [16]. differential privacy (LDP) mechanisms to provide clear privacy guar- While learning a blacklist from CDRs collected by phone honey- antees. We analyze the challenges and trade-offs related to using pots [23] is a promising approach that poses little or no privacy risks, LDP, evaluate our LDP-based system on real-world user-reported it suffers from some drawbacks. First, operating a phone honeypot call records collected by the FTC, and show that it is possible to learn is expensive, as thousands of phone numbers have to be acquired a phone blacklist using a reasonable overall privacy budget and at from telephone carriers. Furthermore, in [23] it has been reported the same time preserve users’ privacy while maintaining utility for that the spam calls targeting the honeypot were skewed towards the learned blacklist. business-oriented campaigns, likely because the honeypot num- bers were mostly re-purposed business numbers (perhaps because KEYWORDS re-purposing a user’s number may pose some privacy risks, since others might still try to reach a specific person at that number). Con- Phone Spam, Collaborative Blacklisting, Local Differential Privacy versely, leveraging user complaints also has some drawbacks. For instance, for a user to be able to complain or label a number (as in TouchPal [21]), the user has to answer to and identify the purpose 1 INTRODUCTION of the call. However, only a fraction of users typically answers and Spam phone calls have been rapidly growing from nuisance to listens to calls from unknown numbers (i.e., numbers not registered arXiv:2006.09287v1 [cs.CR] 16 Jun 2020 supporting well-coordinating fraudulent campaigns [4, 15, 18]. To in the contact list). Furthermore, user-provided call labels are quite counter this increasingly successful attack vector, federal agencies noisy and a relatively high number of complaints about the same such as the US Federal Trade Commission (FTC) have been working number need to be observed, before being able to accurately label with telephone carriers to design systems for blocking (i.e., the source of the calls [21]. This may delay the insertion of a spam automated calls) [12, 13]. At the same time, a number of smartphone number into the blacklist, thus leaving open a time window for the apps that promise to block spam phone calls have appeared on app spammers to succeed in their campaigns. stores [25, 26, 31], and smartphone vendors, such as Google [16], are One possible solution would be to use an approach similar to the embedding some spam blocking functionalities into their default CDR-based blacklisting proposed in [23], while using real phone phone apps. numbers as “live honeypots.” In other words, if a smartphone app Currently, most spam blocking apps rely on caller ID blacklisting, could leverage the call logs of real phone users without requiring the whereby calls from phone numbers that are known to have been users to explicitly label the phone calls, this would provide a solution involved in spamming or numerous unwanted calls are blocked (ei- 1These behavior are inferred merely from publicly available information; further details ther automatically, or upon explicit user consent). Recently, Pandit on the inner-workings of commercial apps are difficult to obtain and their technical et al. [23] have studied how to learn such blacklists from a variety approach cannot be fully evaluated for comparison. may be a cancer patient, given that she has received calls from a

caller IDs cancer treatment clinic). At the same time, while the users’ privacy is protected, the server is able to identify heavy hitter caller IDs that are user 1 highly likely associated with new spamming activities. Hence, our system preserves user privacy by making it difficult for the server to server blacklist caller IDs learn the list of caller IDs that are contacting the users, while keeping its capability of building a blacklist of possible spammers. user 2 While LDP mechanisms provide strong privacy guarantees, they are often studied in theoretical terms and their applicability to practi- caller IDs cal, real-world security problems is often left as a secondary consider- user N ation. On the contrary, in this paper we focus primarily on adapting a state-of-the-art LDP mechanism for heavy hitter detection [3] Figure 1: System overview. Caller IDs are collected with local differential to make it practical, so that it can be used in the smartphone app privacy. After learning, blacklist updates are propagated back to users. described above to report the list of caller IDs to the server. Further- more, we evaluate the ability of the server to accurately reconstruct the (noisy) reported caller IDs under different privacy budgets, and to the drawbacks mentioned above. Unfortunately, this may obvi- evaluate the utility of the learned blacklist. To this end, we implement ously pose serious privacy risks to users. For instance, knowing that both the client-side (i.e., smartphone side) and server-side (i.e., black- a user received a phone call from a specific phone number related to list learning side) LDP protocol, leaving other app implementation a cancer treatment clinic may reveal that the user (or a close family details (e.g., user preferences and controls) to future work. member) is a cancer patient. In summary, we make the following contributions: Research Question: Can these privacy concerns be mitigated, and the users’ call logs be collaboratively contributed to enable learning an • We explore how to build a privacy-preserving collaborative accurate phone blacklist with strong privacy guarantees? phone blacklisting system using local differential privacy To answer the above research question, in this paper we study (LDP). Specifically, we expose what are the challenges related whether it is possible to design a practical phone blacklisting system to building a practical LDP-based system that is able to learn that leverages differential privacy7 [ ] mechanisms to collaboratively a phone blacklist from caller ID data provided by a pool of learn effective anti-spam phone blacklists while providing strong contributing users, and propose a number of approaches to privacy guarantees. Specifically, we leverage a state-of-the art local overcome these challenges. To the best of our knowledge, our differential privacy (LDP) mechanisms for generic heavy-hitter de- system is the first application of LDP protocols to building a tection that has been shown to work only in theory [3], and focus defense against phone spam. on adapting it to enable the implementation of a concrete privacy- • We implement our blacklisting system using a new LDP pro- preserving collaborative blacklist learning system that could be tocol for heavy hitter detection. Our protocol is built upon a deployed on real smartphone devices. To the best of our knowledge, state-of-the-art protocol previously proposed in [3]. We first we are the first to study the application of local differential privacy to show that [3] is not practical, in that it cannot be applied as is building blacklist-based defenses, and specifically towards defenses to collaborative phone blacklisting. We then introduce novel against telephony spam. LDP protocol modifications, such as data bucketization and Figure 1 shows an overview of our system. Participating users in- variance-reduction mechanisms, to enable heavy hitter de- stall an app that can implement the following high-level functionali- tection by building a LDP-based phone blacklisting approach ties(moredetailsabouttheclientappareprovidedinSection5):when that could be deployed on real smartphones. the user receives a call, the app will first check if the caller ID (i.e., the • We evaluate our LDP-based system on real-world user re- calling number) is in the users’ contacts list; if yes, the caller ID is con- ported call records collected by the FTC. Specifically, we an- sidered as trusted and ignored, otherwise the caller ID is considered alyze multiple different trade-offs, including the trade-off to be unknown and buffered for reporting. Unknown caller IDs are between the privacy budget assigned to the different compo- then checked against a blacklist; if a match is found, the user can be nents of our LDP protocol and the overall blacklist learning alerted that the incoming phone call originates from a phone number accuracy. Our results indicate that it is possible to learn a known to have been involved in spamming activities, so that the user phone blacklist using a reasonable overall privacy budget, can decide whether to reject the call. At the end of a predefined time and to preserve users’ privacy while maintaining utility for window(e.g.,oncedaily),theappwillreportunknowncallerIDsfrom the learned blacklist. which the user received a phone call (including both unanswered and accepted calls). Consequently, the server will receive daily reports 2 PROBLEM DEFINITION AND APPROACH from each user, which consist of the list of unknown caller IDs ob- served by the client apps running on each device. As we will explain In this section, we outline our threat model and briefly describe in Section 5, all the caller IDs are delivered by the client apps to the our approach towards collaboratively building phone blacklists in server via a novel LDP mechanism. This is done to provide privacy a privacy-preserving way. guarantees and minimize the risk of the server learning any sensitive Threat Model In designing our phone blacklisting system (see Fig- information about single users’ phone calls (e.g., whether the user ure 1), we make the following assumptions: 2 • We consider the caller ID related to phone calls received by to as SH (short for succinct histogram). Unfortunately, we have found users as privacy sensitive (e.g., see the cancer clinic example that the SH protocol is not suitable as is for providing a solution to given in Section 1). However, we do not consider the caller ID our application scenario (explained in details in Section 4). Among area code prefix (e.g., the first three digit of a US phone num- the main issues we found is the fact that SH tends to work well only ber) as sensitive. The reason is that each area code includes in expectation. As we aim to build a practical blacklisting system, we millions of possible phone numbers (e.g., 107 numbers in the would like our system to perform well for realistic, limited population US). Therefore, even if the attacker learns that a given user sizes (e.g., thousands of users). Furthermore, the protocol used in [3] received a phone call from a given area code prefix, she would for calculating the frequency of occurrence for a heavy hitter (i.e., be faced with very high uncertainty regarding what specific the number of calls made by a likely spammer, in our case) is complex number actually called the user. and difficult to implement efficiently (to the best of our knowledge, • We assume the privacy-preserving data collection app run- no implementation of the full [3] protocol is publicly available). ning on each user’s device is trusted. Namely, we assume the To address the above limitations of the SH protocol, we introduce app correctly implements our proposed LDP protocol (de- three LDP protocol modifications: tailed in Algorithm 2), and that it does not directly collect and report any other user data to the server other than the unknown phone numbers from which calls were received. (1) We propose a novel response randomizer that has the effect • We also assume that the server correctly executes the server- of reducing the variance in the noisy inputs received by the side of our LDP protocol, to learn a useful phone blacklist that server-side of the SH protocol, thus increasing server-side can be propagated back to the users to help them block future heavy hitter reconstruction accuracy even in the case of a spam calls. At the same time, we assume that the server may limited user population (Section 4). at some point be compromised (or subpoenaed), allowing an (2) Second, we replace the frequency oracle part of the SH pro- adversary to access future users’ reports. Unlike traditional tocol proposed in [3] with a much simpler protocol recently curator-based differential privacy mechanisms, our use of proposed in [28], whose implementation is publicly available LDP mechanisms guarantees that, in the event of a breach of (Section 3.4). the server, the privacy of users’ phone call records can still (3) To increase the relative frequency of heavy hitter caller IDs be preserved (see Section 3, for details). and boost the likelihood that the server will be able to cor- rectly reconstruct them and add them to the blacklist, we It is worth noting that the server may be able to observe the IP introduce a bucketization mechanism. In essence, before a address of each reporting device. Furthermore, in a practical de- user (more precisely, the app running on the user’s phone) ployment, the server may realistically implement an authentication reports one or more caller IDs to the server, the caller IDs are mechanism that requires users to register to the blacklisting service first grouped according to their three-digit area code. Then, (e.g., by providing an address, , etc.), to be allowed the client-side portion of the SH protocol is run independently to (privately) report call records and receive blacklist updates. In per each single group (i.e., per each area code). The intuition this case, the identity of the users may be known to the server, and a here is that some spammers tend to use phone numbers from server breach may expose such identities. However, in this paper we specific area codes. For instance, IRS phone scams are often focus exclusively on protecting the privacy of users’ phone call records, performed using caller IDs with a 202 prefix (Washington DC rather than anonymity. Protecting the IP address and identity of users area code), as this may trick more users into believing it is may be achieved via other security mechanisms that are outside the truly the IRS that is calling. By grouping caller IDs based on scope of this work. area code, spam numbers also tend to group, increasing their Approach Overview According to recent work on phone blacklist- relative frequency compared to all other caller IDs with the ing [22, 23], it is clear that most spammers will tend to call a large same prefix. This effect is discussed in details in Section C. number of users, in an attempt to identify a subset of them who may fall for a scam. Therefore, given a large and distributed user population, it is reasonable to consider heavy hitters as candidate Section 4 presents the details of our LDP protocol. spammers. In other words, a caller ID that is reported as unknown Caller ID Spoofing Caller ID spoofing is the main limiting factor by a significant fraction of participating smartphones satisfies the for the effectiveness of phone blacklists in general, as also acknowl- volume and diversity features used in previous work [22, 23], and edged in previous work [21, 23]. Previous research on phone black- can be considered for blacklisting. listing [21, 23] regards the prevention of caller ID spoofing as an Following the high-level approach proposed in previous work, we orthogonal research direction, leaving it to future work. This choice therefore cast the problem of learning a phone blacklist as a heavy can be justified by noting that the FCC has mandated that allUS hitter detection problem. The main research question we investigate phone companies must implement caller ID authentication by June in this paper is the following: using the system depicted in Figure 1, is 30, 2021 [11]. In response, telephone carriers have started activating it possible to accurately detect heavy hitter caller IDs while providing an authentication protocol known as SHAKEN/STIR [15]. In our local differential privacy guarantees? work, we make similar considerations as in previous work, and focus To investigate the above research question, we start from a state- ourattentiononthefeasibilityofbuildingphoneblacklistsusinguser- of-the-art LDP protocol for heavy hitter detection proposed by Bass- provided data with strong privacy guarantees. We therefore consider ily and Smith [3], which throughout the rest of the paper we will refer dealing with caller ID spoofing to be outside the scope of this paper. 3 3 BACKGROUND Algorithm 1: Rbas(x,ε): ε-Basic Randomizer 3.1 Notation Input: m-bit string x, privacy budget ε 1 Sample r ←[m] uniformly at random. Suppose there aren users, and that each user j holds an itemvj drawn 2 if x,0 then ( eε c ·m ·xr w.p. ε eε +1 from a domain V of size d (in our case, V is the set of valid phone 3 zr = e +1 , where c = ε . −c ·m ·x w.p. 1 e −1 numbers). For each item v ∈ V, its frequency f (v) is defined as the r eε +1 4 else √ √ fraction of users who hold v, i.e., f (v)= |{j ∈ [n]:vj =v}|/n, where 5 Choose zr uniformly from {c m,−c m } [n] denotes the set {1,2,...,n}. For notational simplicity, we omit the 6 return z=(0, ...,0,zr ,0, ...,0) subscript j when it’s clear from the context. A frequency oracle (FO) is a function that can (privately) estimate the frequency of any item v ∈ V among the user population. The encoded item xj and its decoding are respectively given by For a vector x=(x1,...,xm), we will use the array index notation th xj =Enc(vj )=c(vj ) and Dec(xj )=vj . x[i] to denote the i entry, i.e., x[i]=xi . Similarly, X[i,j] denotes the entry at location (i,j) for a matrix X. For privacy, each user j perturbs xj into a noisy report zj =Rbas(xj ,ε) using a randomizer Rbas and sends it to the server. The pseudo-code 3.2 Local Differential Privacy of randomizer Rbas is described in Algorithm 1. Differential privacy can be applied to two different settings: cen- To simplify the heavy hitter detection problem, Bassily and Smith tralized and local. In the centralized setting, it is assumed that there applied the idea of isolating heavy hitters into different channels V →[ ] exists a trusted data curator who collects personal data v=(v ,...,v ) using a pairwise independent hash function H : K , whereby 1 n ( ) from users, analyzes it, and releases the results after applying a dif- an item v is mapped to channel H v . ferentially private transformation. On the other hand, in the local This has the effect that, with high probability, no two unique setting there is no single trusted third party. To protect privacy, each heavy hitter items are mapped to the same channel (when K is suf- ( ) ∗ user independently perturbs her record v into v˜ = A(v ) using a ficiently large). For each channel , users with H vj = v encode j j j ( ) randomized algorithm A, and only shares the perturbed version vj into xj =Enc vj and send the perturbed version of xj ; whereas ( ) ∗ R ( ) with an aggregator (the centralized server responsible for blacklist xj =0 for users with H vj ,v and bas 0 is reported to the server. { ,..., } learning, in our application). The local differential privacy (LDP) Given a set of noisy reports z1 zn collected from n users, the model provides stronger privacy protection than the centralized server aggregates them to z¯ (line 6 in Algorithm 7, in Appendix), model, because it protects privacy even when the aggregator (i.e., rounds it to the nearest valid encoding y (line 7-8 in Algorithm 7, the blacklist learning server, in our case) is compromised and con- in Appendix), and finally reconstructs the heavy hitter item by de- ( ) trolled by an adversary. The level of privacy protection depends on coding it into vˆ =Dec y . To estimate the frequency of vˆ, the server { ,..., } a privacy budget parameter ε, as formally defined in [6]; the smaller collects another set of noisy reports w1 wn and estimates the ε, the greater the privacy guarantees. frequency as follows: n n 1 Õ 1 Õ fˆ(vˆ)= ⟨ w ,c(vˆ)⟩ = w [r ]·q[r ], 3.3 The Succinct Histogram Protocol n j n j j j j=1 j=1 Bassily and Smith [3] proposed an ε-LDP protocol, called Succinct Histogram (SH), for detecting heavy hitters over a large domain V. where q=c(vˆ). In their work, the authors assume that each user has a single item To filter out possible false positives, similarly to the previous to share with the server. phase the server collects noisy reports wj from users and aggregates Unfortunately, in [3] the client- and server-side of the protocol are them in a single bitstring w. For each reconstructed value vˆ in Γ, its presented as “interleaved” in a single algorithm, and to the best of our frequency f (vˆ) is estimated using a frequency oracle (FO) function. ˆ knowledge a practical implementation of the client-server protocol If the computed estimate f (vˆ) is less than a threshold η, then vˆ is was not provided. To make the LDP protocol in [3] practical and removed from Γ. After this filtering phase, the server can then return applicable to our collaborative blacklist learning system, we provide the set of detected heavy hitters. a new but equivalent representation of the protocol proposed in [3] The threshold η plays a crucial role in the heavy hitter detection: that focuses on the interactions between clients (i.e., the system r 2T +1 log(d)log(1/β) contributors) and server. Due to space limitations, we report our new η = (1) client-server formulation in Appendix A (see Algorithms 6 and 7). ε n The SH protocol works as follows. First, each user j ∈ [n] en- where β [3] is a parameter related to the confidence the server has codes her item vj ∈ V into a bit string of length m using a binary on the heavy hitters it has detected. The same parameter β also error-correcting code (Enc,Dec)2. For notational simplicity, we let influences the number of protocol rounds,T [3]. √ √ m Enc(·)=c(·). Let xj ∈ {−1/ m,1/ m} be the encoded binary string. We now analyze the properties of the basic randomizer. It is easy to see that for every encoded item x ∈ { √1 ,− √1 }m ∪{0} its noisy m m report w=(w1,...,wm) is an unbiased estimator of x. For users with 2 m Specifically, the protocol requires a [2 ,k,d]2 binary error-correcting code, where x,0 and an integer r ∈ [m], 2m , k, and d represent the codeword length, encoded message length (in bits), and ε minimum distance, respectively, in which the relative distance d/2m has to be included  e 1  E[wr ]=cm − xr =m·xr and in the interval (0,1/2). eε +1 eε +1 4 1 E[w]= (E[w ],...,E[w ])⊺ =x. 4.1 Overview of LDP Protocol m 1 m Following the intuitions and motivations for our approach provided For users with x=0, E[wr ]=0 for r ∈ [m], and hence E[w]=0=x. in Section 2, we cast the problem of learning a phone blacklist as a ∀ It is also easy to see that for v ∈ V the estimated frequency fˆ(v) has heavy hitter detection problem. To this end, we build our solution the following properties (see Appendix B for details): upon the SH protocol for heavy hitter detection proposed in [3] and summarized in Section 3. Unfortunately, the original SH protocol is E[ ˆ( )]= ( ) f v f v not directly suitable for our application, because it was formulated and in a theoretical “asymptotic” setting in which a large number of reporting clients is assumed [3].  ε 2  1 e +1 First, we implemented a practical client-server formulation of the Var(fˆ(v))= − f (v) . (2) n eε −1 SH protocol proposed in [3]. After performing pilot experiments, we found that applying the protocol as is in a setting with a limited Limitations: The SH protocol described in [3] was presented in a number of clients (e.g., in the tens of thousand) and in which we aim purely formal way, without addressing limitations that exist in practi- to correctly reconstruct heavy hitters with a relatively low volume cal systems. For instance, the original SH protocol was formulated in (e.g., only a few hundreds of hits) was not possible without setting an an “asymptotic” setting, in which a large number of reporting clients extremely high privacy budget, thus completely jeopardizing users’ is assumed. While the protocol works well in expectation, it presents privacy. a number of practical drawbacks, which we discuss in Section 4. To make SH usable in practice and adapt it to our phone blacklist- ing problem, we therefore designed and implemented a number of 3.4 Frequency Oracle Protocol modifications that address the following two fundamental problems: Wang et al. [28] recently proposed the Optimal Local Hashing (OLH) • protocol for estimating the frequency of items belonging to a given Sparsity of user reports: In the SH protocol, the larger the items V domain. It satisfies ε-LDP and is simpler and logically equivalent domain , the more frequent an item must be to be correctly to the frequency oracle proposed in [3]. Instead of transmitting a reconstructed by the server C with high probability. Namely, single bit, that is the result of mapping an item i to a binary value an item must be reported by a larger and larger population, √ √ V {c m,−c m}, the n users who participate in the system simply hash as the cardinality of increases, thus potentially impeding their items into a value in [д], where д ≥ 2. The pseudo-code of the the reconstruction of spam phone numbers involved in cam- OLH randomizer is reported in Algorithm 5 (in Appendix A). For paigns that reach only a portion of the contributing users. • further details, we refer the reader to [28] and to its publicly avail- High variance in the sum of item reports hinders noise cancella- able implementation [27]. It is worth noting that OLH is limited to tion: The sum z in Algorithm 7 (line 6) is affected by the high frequency estimation, and that it is not suitable by itself for heavy variance of the distribution of the sum of each bit. Ideally, hitter detection in large domains, as for the case in which the domain the noisy random bits sent by users who do not hold value includes all possible valid phone numbers. v (i.e.,that transmit a randomized version of x = 0 in Algo- rithm 6, lines 7-8) should cancel out during summation. While 4 SYSTEM DETAILS this holds in expectation, in practice (with a finite number of participants) it is highly unlikely to have the very same As mentioned earlier, we envision a collaborative blacklisting system number of clients who transmit √1 as clients who transmit consisting of n distinct smartphones and a centralized server C, as m − √1 , potentially causing the reconstruction of the wrong bit shown in Figure 1.C is responsible for receiving data from the partic- m ipating phones and for computing a blacklist of phone numbers (i.e., value at the server side. caller IDs) likely associated with phone spamming activities. Once computed, the blacklist can be propagated back to the participating We address the first issue by introducing a data bucketization phones to enable flagging future unwanted calls as likely spam. mechanism. Specifically, we take advantage of characteristics of Each participating smartphone runs an application that collects the phone blacklisting problem to (1) reduce the dimensionality of information about phone calls received from unknown phone num- the items domain, and (2) partition the problem domain to increase bers, where unknown here means that the caller’s phone number was the relative frequency of heavy hitters. To achieve (1), we divide not registered into the smartphone’s contact list. More precisely, let phone numbers into area code prefix (e.g., the first three digits ina pi be a participating smartphone and cj be a caller ID. If pi receives US telephone number) and phone number suffix (e.g., the remaining a call from cj and cj is not in pi ’s contact list, then cj is labeled as seven digits, for a US phone number). unknown and reported to C by pi . Notice that, in this scenario, only In telephony, area codes are typically not considered to be sensi- the caller ID cj is reported, and no information about the content of tive. For instance, the FTC dataset protects the privacy of complain- the call is shared with C. ing users by publishing only their area code [14] (see also Sections 2 To preserve the privacy of phone calls received by participating and 5.1). As outlined in Section 2, we aim to protect the privacy of users (i.e., the owners of the phones that contribute data to C), the the phone number suffix. Hence, given the large number of phone caller ID data collection app running on each smartphone imple- numbers that share the same prefix, clients can transmit the area ments a local differentially private (LDP) algorithm, whose details code as is, and apply the (modified) SH protocol only to the phone are described below in this section. number suffix, thus reducing the number of bits needed to represent 5 Algorithm 2: Modified SH-Client(v, H,T , K, εHH, εOLH) Algorithm 3: Modified SH-Server(T , K, P, OLH) Input: the value to be sent v, a fixed list of hash Input: # of repetitionT , # of channels K, set of prefixes P, a frequency oracle OLH, threshold η functions H, # of repetitions T , # of channels K, privacy parameters εHH and εOLH Output: list of heavy hitters Γ /* sending noisy reports for heavy hitter detection */ /* detecting heavy hitters */ 1 ← ∅ 1 ρ ←pref ix(v) Γ 2 for to do 2 σ ←v \ρ t =1 T 3 foreach prefix ρ ∈ [P] do 3 for t =1 to T do 4 foreach channel k ∈ [K] do 4 H ← H[t] 5 foreach user j ∈ [nρ ] do 5 foreach channel k ∈ [K] do ( , , ) 6 z ←z t k ρ 6 if H (σ )=k then j 7 x=Enc(σ ) value received from user j on channel k, having prefix ρ; nρ 7 z= 1 Í z 8 else nρ j=1 j 9 x=0 8 for i =1 to m do ( , , )  εHH  1 10 z t k ρ ← R x, ( √ if z[i] ≥ 0 ext m 2T 9 y[i]← (t,k, ρ) − √1 otherwise. 11 Send z to the server on channel k m /* sending noisy report for heavy hitter frequency estimation */ 10 σˆ ←Dec(y) (ρ) 12 w ← ROLH(v,εOLH) 11 vˆ ← append σ to ρ ( ) 13 Send w ρ to the server 12 if vˆ < Γ then add vˆ to Γ /* filtering out false positives */ 13 foreach prefix ρ ∈ [P] do 14 foreach user j ∈ [nρ ] do ( ) 15 w[j]←w ρ value received from user j having prefix ρ each phone number. To achieve (2), we assign a separate communi- 16 foreach vˆ ∈ Γ do cation channel between clients and server to each area code, and run 17 fˆ(vˆ)← estimate the frequency of vˆ using OLH(w) an instance of the (modified) SH protocol independently for each 18 if fˆ(vˆ) < η then remove vˆ from Γ area code. This has the effect of clustering phone numbers based 19 return {(v,fˆ(v)) : v ∈ Γ} on their prefix. Because in some cases phone spam campaigns are conducted using specific area codes (e.g., a Washington DC area code for IRS spam campaigns, or an 800 prefix for tech support scams, etc.), Notice also that while in the following we present our LDP pro- this bucketization of phone numbers has the effect of amplifying the tocol under the assumption that each user has a single item to share relative frequency of spam-related caller IDs in some of the area code with the server (e.g., one unknown phone number report per day), in buckets (or clusters), thus making it easier to detect heavy hitters. real scenarios some users may either have multiple items to share or Figure 2 shows a example of how in practice bucketization helps to nothing to share at all (e.g., no phone calls received in a given day). amplify the relative frequency of heavy hitters. In this case, the protocol can be easily extended as proposed in [24], In summary, we (i) group phone numbers by area code, (ii) split by sampling a single telephone number from the set of unknown area code and phone suffix, (iii) select the client-server communica- calls that the app has collected. Conversely, if a user has nothing to tion channel based on the area code, and (iv) let each client transmit share, the app can generate a dummy (but legitimate) phone number the area code in clear (i.e., no LDP) and run the SH protocol over to be sent to the server. phone number suffixes within transmitted area codes. Itisalsoimportanttonoticethatorganizingphonenumberreports To address the high variance in the sum of item reports, we intro- in buckets allows the server to count the number of users that will duce a new extended randomizer to replace the original randomizer participate to a protocol run, per each bucket. In Appendix C, we dis- proposed in [3] and reported in Algorithm 1. The main idea is to use cuss how the server can use this number to estimate the probability a three-value randomizer. For instance, when x=0 must be sent, in- that at least one heavy hitter in a specific bucket will be successfully stead of choosing a random bit value between { √1 ,− √1 }, the client m m reconstructed. If such probability is low, the server can avoid exe- app will choose between three values: { √1 ,0,− √1 }, with different cuting the protocol for those buckets and inform the clients of this m m probabilities. Our externded randomizer is defined in Algorithm 4. decision, thus preventing those clients from wasting their privacy In Appendix D we formally show how this extended randomizer budget. In practice, once the server receives the area codes from each helps reducing the variance, thus increasing the accuracy with which client, it could send a message back to the clients letting them know privately-reported phone numbers are reconstructed on the server if they should send (using the LDP protocol) the remaining portion of side. the caller IDs they observed (i.e., the remaining seven digits) or not.

0.020 0.8 Algorithm 4: Rext(x,ε): ε-extended Randomizer 0.015 0.6 Input: m-bit string x, privacy budget ε 0.010 0.4 1 Sample r ←[m] uniformly at random.

Frequency(%) Frequency(%) 2 if x 0 then 0.005 0.2 , c ·m ·xr w.p. p 0.000 0.0  3 zr = −c ·m ·xr w.p. q , where c > 0. 0 w.p. 1−p −q 

6125551818 2146760352 4155207631 5024054000 9737557757 9143198775 5169265070 8552736162 2108212746 8773170948 8135279718 7149022156 4403903495 9543981853 2016393690 6468809280 2109346124 2163711513 2038934848 2027591118 8558539462 7023309727 6125551818 2146760352 4155207631 5024054000 9737557757 9143198775 5169265070 8552736162 2108212746 8773170948 8135279718 7149022156 4403903495 9543981853 2016393690 6468809280 2109346124 2163711513 2038934848 2027591118 8558539462 7023309727 4 else Phone number Phone number √ c m w.p. θ  √ 5 zr = −c m w.p. θ 0 w.p. 1−2θ Figure 2: Phone number frequency before (left) and after (right) bucketization,  6 return z=(0, ...,0,z ,0, ...,0) computed on one day of complaints from the FTC dataset. r

6 The new LDP protocol resulting from our improvements over complaint is reported by a different user. Based on this, we determine the original SH protocol is shown in Algorithms 2 and 3, where the pool of users that would participate in our collaborative black- new pseudo-code is highlighted in black, and code that remains the listing as follows: we compute the maximum number of reports seen same as in the original SH protocol is shaded in gray. Unlike the in one day, throughout the entire month of FTC data, which is equal original version, we explicitly allocate two different privacy budgets, to 23,188; we then set the number of participants to that number. In εHH and εOLH, assigned respectively to heavy hitter detection and days in which fewer than 23,188 users sent complaints to the FTC, frequency estimation. It is worth mentioning that εHH is the total we assume that the remaining users did not have any calls to report privacy budget spent by each user to send noisy reports to the server (i.e., they did not receive any calls from unknown numbers on those during the T protocol rounds (lines 3 − 11). In this new formula- days). However, to preserve differential privacy guarantees, all users tion ε =(εHH +εOLH) and the protocol is (εHH +εOLH)-differentially must send a report every time the protocol runs (e.g., daily, in our private, as proved in Appendix D. evaluation). Therefore, if a user has no calls to report, the app on her smartphone will generate a random (but valid) 10-digit number, and 5 LDP PROTOCOL EVALUATION report that number to the server using the LDP protocol, exactly as if she received a call from that number. In this section, we present an evaluation of our LDP protocol. It is important to notice that we focus primarily on estimating the accu- racy of our system with respect to reconstructing and detecting heavy 5.2 LDP Protocol Configuration hitters. We will discuss the utility of the phone blacklist that may be In all our experiments we use area code bucketization, with the same learned from the detected heavy hitters separately, in Section 6. parameter settings for all buckets. We also compare results obtained using the basic randomizer proposed in [3] to results obtained using 5.1 Dataset our extended randomizer (Algorithm 4), using the same parameter Ideally, to evaluate our protocol we would need to collect a dataset of settings for each, to allow for an “apples to apples” comparison. phone call records from thousands of users. Even though we assume We set two different privacy budgets to run the heavy hitter detec- only calls from unknown numbers (i.e., numbers not stored in the tion and the frequency estimation protocol. Respectively, we experi- users’ respective contact lists) would be of interest for detecting po- ment with budgetεHH ∈ {12,8.8,7,5.6,4.4} for heavy hitter detection, tentialspamphonenumbers,collectingsuchadatasetisverydifficult, and εOLH =3 for frequency estimation. We chose the values εHH for due exactly to the same privacy issue we aim to solve in this paper. both the basic and extended randomizer so that the randomizer sends As a proxy for that dataset, we make use of real-world phone data ex- the correct phone number with probability approximately equal to tracted from user complaints to the FTC. In essence, the FTC allows 0.95, 0.90, 0.85, 0.80 and 0.75. Moreover, we do not allocate a εHH users in the US to report unwanted phone calls. Reported complaints value lower than 4.4 given that, as later discussed in Section 7, the util- that are made available to the public typically include the time of the ity of the learned blacklist is already 0 when εHH =4.4 (see Figure 7). complaint, the full caller ID, the user’s phone number area code, and a It is important to also notice that guaranteeing differential privacy label indicating the type of phone spam activity. Notice that the FTC under continual observation [8] in the more difficult LDP setting is aim to protect the privacy of the users who report a complaint; that’s still an open problem in differential privacy research. A complete why they only publish users’ phone number prefixes. On the other solution to such a challenging open problem is therefore left to future hand, the caller IDs are reported in clear because the complaining work. Nonetheless, one possible mitigation may be to design the users explicitly label them as unwanted caller, essentially consenting data collection app so that a user does not report the same number to their public release. On the other hand, our system does not require more than once (see Section 7 for further discussion). users to explicitly label unwanted phone calls; all unknown caller IDs To implement the binary encoding of phone numbers (see Algo- arereported, andprivacyispreserved viaLDPmechanisms(please re- rithm 2 line 7), we use a Reed Muller error-correcting code, RM(3,5); fer to Section 1, where we motivate the advantages of this approach). this is a [32,26,4]2 code with relative distance equal to 1/8 and error In this paper, we treat each complaint as if a user participating correcting capability equal to 1 bit [17]. Notice thatk =26 bits is suffi- in our collaborative blacklist learning system had received a phone cient to encode 7-digit phone numbers, which requires a minimum of call from the complained-about number at the time recorded in the 24 bits. At server-side, we run the heavy hitter detection phase only complaint. We were able to obtain a large set of user complaints for those buckets containing more than a minimum number, τ , of collected by the FTC between Feb. 17th 2016 and Mar. 17th, 2016, complaints, as motivated in Sections 4.1 and C. We experiment with for a total of 29 days, which consists of 471,460 complaints. For the 5 different values of τ in the set {143,151,161,174,195}. These values sake of this evaluation, we consider only valid 10-digit caller IDs, correspond, respectively, to a 75%, 80%, 85%, 90%, 95% probability of which constitute about 95% of the entire dataset. The distribution correctly reconstructing, at least, a phone number per bucket (see of complaints is characterized by two properties: (a) the volume of also Equation 4). complaints follows a weekly pattern, with fewer complaints submit- ted on weekends; and (2) the distribution of complaints per caller ID has a long tail, whereby most phone numbers receive only one 5.3 Measuring Heavy Hitter Detections complaint but there also exist many phone numbers that receive We now define how we measure accuracy for the LDP protocol. No- hundreds of complaints (see Figures 9 and 10 in appendix for details). tice that in this section we consider accuracy strictly for the heavy As the FTC dataset does not contain an identifier for the reporting hitter detection task accomplished by the LDP protocol. This is re- users, without loss of generality we assume that, within a day, each lated to but different from the utility of the blacklist that canbe 7 10 Extended Extended 100 Basic Basic

5 50 THH FHH UHH 50 Extended Basic 0 0 4.4 5.6 7.0 8.8 12.0 4.4 5.6 7.0 8.8 12.0 4.4 5.6 7.0 8.8 12.0 εHH εHH εHH Figure 3: True, false and undetected heavy hitters (respectively, THHs, FHHs and UHHs). Parameters: T =2, εOLH =3, and τ =143.

Extended Extended Extended Basic Basic Basic 100 4 20 THHs FHHs 2 UHHs 80

0 0 143 151 161 174 195 143 151 161 174 195 143 151 161 174 195 τ τ τ Figure 4: True, false and undetected heavy hitters (respectively, THHs, FHHs and UHHs). Parameters: T =2, εHH =8.8, εOLH =3. learned over the phone numbers reconstructed by the server-side 90 LDP protocol (see Section 6 for results on blacklist utility). Let v be a phone number, and c(v) be the number of users who 80 reported a call from v. We say that c(v) is the ground truth frequency 70 of v. Moreover, let fˆ(v) be the number of reports about v estimated 60 F1-score (%) by the server after running the LDP protocol, and η be the detection 50 Extended threshold for heavy hitter detection defined in Equation 3 (see also Basic 40 Algorithms 7 and 3). Also, as discussed in Section 5.2, let τ be the 4.4 5.6 7.0 8.8 12.0 minimum number of complaints necessary for the server to decide εHH whether to run the heavy hitter detection phase for a bucket. Figure 5: F1-score with parameters T =2, εOLH =3, and τ = 143. In theory, we could simply use η as heavy hitter frequency thresh- old, to measure true and false detections. However, given Equation 3 the above definitions, and their analogy with true and false positives and substituting practical values of ε and β, η tends to be much in detection systems, we measure the F1-score of the heavy hitter smaller than τ . For instance, considering ε =15, β =0.751,T =2, and detection protocol as: 7 d =10 , and assuming n =1000 users reporting caller IDs to a given • Recall: R =THHs/(THHs+U HHs) area code bucket, we obtain η ≊0.023. Thus, the heavy hitter detec- • Precision: P =THHs/(THHs+FHHs) tion threshold (in terms of number of reports per caller ID) would • F1-score: F1 =2∗(P ∗R)/(P +R) be η·n =23. In other words, a phone number would be considered a heavy hitter if it is reported more than 22 times. Yet, as we discussed 5.4 LDP Heavy Hitter Detection Accuracy in Section C, the minimum number of reports needed for the server Figure 3 shows the number of THH detected for different privacy to correctly reconstruct a phone number with high probability is budgets ε , when T = 2, and ε = 3 (error bars represent one much higher than 23 (e.g., at least 84 reports are needed to have a 50% HH OLH standard deviation). The figure compares the accuracy that can be chance of correct reconstruction). We therefore use τ as the heavy obtained by using the basic randomizer (as in [3]) and our extended hitter detection threshold, rather than relying on η. Specifically, we randomizer (Algorithm 4). The maximum privacy budget spent daily define the following quantities. by each client running the LDP protocol can be computed by sum- • True Heavy Hitters (THHs). We have a true heavy hitter ming privacy budgets εHH and εOLH (e.g., ε =15, when εHH =12 and detection for v if both c(v) >τ and fˆ(v) >τ . εOLH =3). It is worth noting that an user may have no phone number • False Heavy Hitters (FHHs). We have a false heavy hitter to report: in that case, the privacy budget spent by the client would v c(v) ≤τ fˆ(v) >τ detection for if whereas . be 0. Each experimental evaluation with a given εHH and εOLH was • Undetected Heavy Hitters (UHHs). Wehave an undetected repeated 10 times, and the results averaged. heavy hitter if c(v) >τ whereas fˆ(v) ≤τ . Figure 3 also reports the number of FHHs and UHHs obtained for It is worth noting that FHHs are typically due to a phone number different values of the privacy budget. As can be seen, the LDP proto- v whose true frequency c(v) is just below τ , and for which the noise col detects less than 8 FHHs, on average. As mentioned in Section 5.3, introduced by the LDP protocol causes the server to (by chance) such false heavy hitters are phone numbers whose frequency is just estimate its frequency above the heavy hitter detection threshold. below τ and whose LDP-estimated frequency happens to slightly On the other hand, UHHs represent heavy hitters that the protocol exceed the heavy hitter detection threshold due to the randomiza- fails to detect, due to the random noise added by the clients. Given tion of user contributions. In addition, the higher the εHH allocated 8 100 CBR ε=15.0 CBR ε=8.6 80 CBR ε=11.8 CBR ε=7.4 CBR ε=10.0 90 60

80 40 F1-score (%) 70 20 Extended Median CBR Utility (%) Basic 0 60 60 80 100 120 143 151 161 174 195 θ τ Figure 6: F1-score with parameters: T =2, εHH =8.8, εOLH =3, and τ =143. Figure 7: CBR: percentage of calls blocked compared to the baseline.

week going from Feb. 17th to Feb. 23rd. The CBR is then computed by for detecting heavy hitters, the higher the probability of correctly measuring how many calls are flagged by B on the day of deployment. reconstructing phone numbers whose frequency is just below τ To enable a comparison between the private and non-private and, hence, generating FHHs. Figure 4 shows how THHs, FHHs, versions of blacklist learning, we set the same fixed heavy hitter and UHHs vary with τ , using the same parameters of the previous detection threshold θ for both. In other words, in the case when no experimental evaluations, but fixing εHH to 8.8. As can be observed, privacy is offered, caller IDs that are reported by more than θ users increasing τ decreases the total number of detectable heavy hitters in a day are considered as potential spammers. Similarly, when our (i.e., the sum of THHs and UHHs), as expected, since fewer and fewer LDP protocol is used to learn the blacklist, we fix the heavy hitter reported caller IDs will have a true frequency c(v) >τ . detection threshold τ = θ. The other protocol parameters in this It is also important to notice that, as shown in Figure 5, overall our experiment are set toT =2 and εOLH =3, while varying εHH. LDP protocol with the extended randomizer performs better than As a baseline, we compute (over the FTC dataset) the median call using the basic randomizer proposed in [3], when εHH ∈ {12,8.8,7}, blocking rate CBR∗ that can be achieved throughout a month of FTC which yield an F1-score above 85%. The scores have been computed reports, without applying any privacy-preserving mechanism and by using THHs, FHHs, and UHHs depicted in Figure 3, averaged for different values of θ (we compute the median because it is less across 10 runs. For lower values of εHH, the F1-score decreases sig- sensitive to outliers, compared to the average). Then, we compare nificantly, and the extended randomizer tends to perform slightly CBR∗ to the CBR obtained by the blacklist learned using our LDP worse than the basic randomizer. This is because the probability of protocol, by computing the median of the fraction of calls that our injecting noise in the reports (at the clients side) increases consider- blacklist would block, compared to CBR∗. The results are reported in ably. This aspect, in combination with the fact that, by definition, the Figure 7. As can be seen, as the overall privacy budget ε increases, the extended randomizer has a lower probability of sending the correct CBR approaches the baseline CBR∗, which is indicated by the 100% report to the server, compared the basic randomizer, determines a mark. It can be noticed that the difference with the baseline increases slight reduction in performance for low values of εHH. as θ reduces. This is because it is more unlikely that the server will Finally, Figure 6 reports the F1-score as τ changes. Higher values correctly reconstruct caller IDs that have a lower number of reports of τ allow us to obtain higher scores, because the number of UHHs (see Section C). Therefore, as θ decreases, heavy hitters with low decreases (see Figure 4). The basic and extended randomizers follow frequency (close to θ) can still be detected in the scenario without similar trends, though the extended randomizer performs better privacy, but become more difficult to detect for our LDP protocol. than the basic one, independently from the choice of τ . In practice, whenever a user receives a call from an unknown caller ID that is in the blacklist, the app will inform the user that the 6 BLACKLIST UTILITY number is suspicious, and potentially involved in spamming. The In Section 5 we evaluated the ability of our LDP protocol to accu- user may ultimately decide to pick up the call, but use more caution rately detect heavy hitter caller IDs. We now look at how a blacklist when interacting with the other party. learned over heavy hitters detected using our protocol would fare compared to when no privacy is preserved, whereby caller IDs are 7 DISCUSSION collected from users’ phones and sent directly to the server (no noise In Section 5.4, we have reported several results related to the accuracy added). To compare these scenarios, we leverage the call blocking of the proposed LDP protocol using different privacy budgets and rate (CBR) metric proposed in [23]. confidence parameters. Depending on how much budget the server In a way similar to [23], we define a blacklist B as a set of caller IDs provides to system users, the SH protocol parameters can be tuned that have been reported by users more than θ times. Specifically, as to control privacy/utility trade-offs. As mentioned in Section 5.2, in [23], we use a sliding window mechanism, whereby a blacklist B Apple uses up to ε =16 [1] as privacy budget for gathering statistics. is updated daily by cumulatively adding daily heavy hitter caller IDs For instance, the Safari browser allows for two user contributions observed over the past time window (one week, in our experiments). per day, with ε =8 each. On the other hand, in this paper we exper- Blacklisted caller IDs older than the sliding window are forgotten, imented with a maximum privacy budget of ε = 15 with one user and removed from B. As an example, making use again of the FTC contribution per day. While the privacy budget may seem somewhat dataset (see Section 5.1), the blacklist that each user deploys on Feb- high compared to non-local differential privacy applications, it is ruary 24th contains all the heavy hitters detected each day during the worth noting that this is due to the inherent complexity of LDP. 9 For instance, it has been shown that ε-LDP distribution estimators and utility we presented in this paper would still be relevant even if require k/ε2 times larger datasets than a comparable non-private [3] was replaced by a different LDP heavy hitter detection protocol. algorithm [6, 19], wherek is the size of the input alphabet (i.e.,k is the In Section 5, we performed experiments with a fixed value of pa- number of possible phone number combinations, in our case). As k rameterT =2. In the original formulation of the SH protocol [3],T is can be very large, higher values of ε allow us to achieve an acceptable directly related to the parameter β we mentioned in Section 3. While utility even with relatively small values of the sample set size n (i.e., it would be possible in theory to use higher parameter values, in- the number of noisy reports received by the server). Furthermore, creasingT (by varying β) would result in a higher number of protocol we also showed that even for lower values of epsilon (e.g., ε =11.8), rounds, and would thus consume a much larger privacy budget ε for blacklist utility can still be reasonable (e.g., around 80% of the CBR∗ eachuser.Conversely,increasingT whilekeepingε fixedwouldcause obtained in the scenario with no privacy, as shown in Section 6). a significant degradation of heavy hitter detection accuracy, andin A privacy budget that can provide more privacy while keeping a turn of the blacklist utility. Therefore, for the sake of brevity, we did good performance trade-off is ε = 10 (with T = 2, εHH = 7, εOLH = 3, not report experimental results obtained with larger values ofT . and τ =143): our experimental evaluation shows an F1-score higher than 75% with the detection of more than 97 potential spam phone numbers per day, on average. 8 RELATED WORK A limitation of our system, which is common to practical deploy- Besides RAPPOR [9, 10], which we briefly discussed in Section 7, ments of LDP such as in the case of Apple and other vendors, is that there exist other works related to LDP heavy hitter detection; we guaranteeing differential privacy under continual observation8 [ ] briefly discuss them below. However, it should be noted that ourwork in an LDP setting is still an open research problem in differential is different from the ones discussed here. Our main contributions privacy. As a possible mitigation, the data collection app running are in adapting a state-of-the-art protocol proposed in [3] to make it on the user’s phone can keep history of the reported numbers and practical, and in using the adapted protocol to build a collaborative avoid reporting the same calling number more than once within a phone blacklisting system with provable privacy guarantees. given time window (e.g., one month). This would make it much more In [24], the SH protocol proposed in [3] is extended to handle set- difficult for the server to identify a phone number that mayhave valued data, where each user holds a set of items v= {v1,...,vt } ⊆ V. called a specific user with high frequency (e.g, once a day), since One difficulty in the set-valued data setting is that the length ofthe it will be reported only once by that user. At the same time, if the itemset each user has is different. To address this challenge, Qin et same number is reported only once but by many users, it can still al [24] proposed a protocol, called LDPMiner, for finding heavy hit- be detected as heavy hitter and added to the blacklist. ters from set-valued data. The main idea of LDPMiner is to pad each It is also possible that a legitimate phone number may be reported user’s itemset with dummy items to ensure that it has the fixed length by many users, such as in the case of school alert numbers or other ℓ. Each user randomly samples one item from v and reports the item types of emergency phone numbers that may contact a large number usingtheSHprotocol.TheestimatedfrequencyofitemsinLDPMiner of users at once, since these numbers may not be recorded in every is multiplied by ℓ to account for the random sampling procedure. user’s contact list. Such phone numbers may potentially be detected Bassily et al. [2] and Wang et al. [29] independently proposed a as heavy hitters, and thus considered by the server for blacklisting. similar protocol that iteratively identifies heavy hitters using a prefix However, the server could check the validity of a number, before tree. In their protocol, users are randomly split intoд disjoint groups. propagating it to the blacklist. For instance, the server could use au- At iteration i, the server receives noisy reports from the users in tomated reverse phone number lookup services (e.g., whitepages.com) the ith group, . Each user in the ith group reports the randomized to filter out possible false positives related to emergency numbers. version of the first li bits of the encoded item (i.e., a prefix of length Our work is based on the heavy hitter LDP protocol proposed li ), where l1 < l2 ··· < lд. After aggregating the user reports from th in [3], which, to the best of our knowledge, was one of very few the i group, the server identifies frequent prefixes Ci of length li state-of-the-art LDP protocols for heavy hitter detection at the time and builds the candidate heavy hitter items of length li+1 by con- l −l when we started the research presented in this paper. Alongside [3], catenating Ci with strings in {0,1} i+1 i . Recently, Wang et al. [30] RAPPOR [9, 10] is another protocol that could be adapted to fit our provided a thorough analysis on the “pad-and-sampling-based fre- problem. However, it has been shown that RAPPOR performs less quency oracle (PSFO)” and proposed an LDP solution to the frequent well than a more recent protocol named TreeHist [2], and that in turn itemset mining problem. Their protocol adaptively chooses between TreeHist itself has a higher worst-case error, compared to the original two algorithms based on the size of the domain |V|. SH protocol proposed in [3]. Similarly, it has been shown in [28] that for frequency estimation the OLH protocol (which we summarized in Section 3.4 and used in our system) performs better than RAPPOR. 9 CONCLUSION Recently, a few new LDP protocols for heavy hitter detection We proposed a novel collaborative detection system that learns a have been also proposed [2, 24, 29]. However, [3] remains a state-of- list of spam-related phone numbers from call records contributed the-art protocol that has inspired more recent works. Furthermore, by participating users. Our system makes use of local differential in this paper we focus on studying how to make LDP heavy hitter privacy to provide clear privacy guarantees. We evaluated the sys- detection practical to address an important and previously unsolved tem on real-world user-reported call records collected by the FTC, security problem: privacy-preserving collaborative phone blacklisting. and showed that it is possible to learn a phone blacklist in a privacy We believe that the application-specific trade-offs between privacy preserving way using a reasonable overall privacy budget, while at the same time maintaining the utility of the learned blacklist. 10 REFERENCES [28] T. Wang, J. Blocki, N. Li, and S. Jha. Locally differentially private protocols for fre- [1] Apple Inc. Apple differential privacy technical overview, 2016. https://images. quency estimation. In 26th USENIX Security Symposium (USENIX Security 17), 2017. apple.com/privacy/docs/Differential_Privacy_Overview.pdf. [29] T. Wang, N. Li, and S. Jha. Locally differentially private heavy hitter identification. [2] R. Bassily, k. nissim, U. Stemmer, and A. Guha Thakurta. Practical locally private arXiv preprint arXiv:1708.06674, 2017. heavy hitters. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, [30] T. Wang, N. Li, and S. Jha. Locally differentially private frequent itemset mining. In S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information 2018 IEEE Symposium on Security and Privacy (SP), volume 00, pages 578–594, 2018. Processing Systems 30, pages 2288–2296. Curran Associates, Inc., 2017. [31] YouMail. Stop robocalls forever, 2019. https://www.youmail.com. [3] R. Bassily and A. Smith. Local, private, efficient protocols for succinct histograms. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 127–135. ACM, 2015. A CLIENT-SERVER SH ALGORITHMS [4] T. S. Bernard. Yes, it’s bad. robocalls, and their scams, are surging, 2018. https://www.nytimes.com/2018/05/06/your-money/robocalls-rise-illegal.html. Algorithms 6 and 7 show the client-server formulation of the SH [5] G. Bianchi, L. Bracciale, and P. Loreti. Better than nothing privacy with bloom protocol discussed in Section 3.3. filters: To what extent? In J. Domingo-Ferrer and I. Tinnirello, editors, Privacy in Statistical Databases, volume 7556 of Lecture Notes in Computer Science, pages In order to run the client protocol, each client first needs to know 348–363. Springer Berlin Heidelberg, 2012. the number of communication channels K that has to be established [6] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical with the server for sending private reports. Hence, before starting minimax rates. In Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, FOCS ’13, 2013. the SH protocol, the server communicates the correct number of [7] C. Dwork. Differential privacy. In in ICALP, pages 1–12. Springer, 2006. channels K to the clients. Notice that the server is the only one who [8] C. Dwork, M. Naor, T. Pitassi, and G. N. Rothblum. Differential privacy under can compute K, since K depends on the number of users contributing continual observation. In Proceedings of the Forty-second ACM Symposium on Theory of Computing, STOC ’10, pages 715–724, New York, NY, USA, 2010. ACM. to the system at any given time. ForT times, in each channel k and [9] U. Erlingsson, V. Pihur, and A. Korolova. Rappor: Randomized aggregatable roundt in [T ], the user sends to the server a randomized report z(t,k), privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, CCS ’14, pages 1054–1067, which represents the (encoded) value ofv she holds or a special value New York, NY, USA, 2014. ACM. 0 indicating that the user does not hold a value to be reported. [10] G. Fanti, V. Pihur, and Úlfar Erlingsson. Building a rappor with the unknown: The choice of sending the randomized report associated with Privacy-preserving learning of associations and data dictionaries. Proceedings on Privacy Enhancing Technologies (PoPETS), issue 3, 2016, 2016. Enc(v) (or with 0) depends on whether the channel identifier k [11] FCC. FCC mandates that phone companies implement caller ID authentication matches the value returned by the hash function H applied on v. H to combat spoofed robocalls, 2020. https://docs.fcc.gov/public/attachments/DOC- belongs to a pairwise independent hash function family H, publicly 363399A1.pdf. [12] FTC. Abusive robocalls and how we can stop them, 2018. https: available and accessible to all the clients as part of the client-side //www.ftc.gov/system/files/documents/public_statements/1366628/p034412_ protocol configuration. commission_testimony_re_abusive_robocalls_senate_04182018.pdf. [13] FTC. FTC and FCC to host joint policy forum and consumer expo to fight the In each round of the protocol, a different hash function is em- scourge of illegal robocalls, 2018. https://www.ftc.gov/news-events/press- ployed to minimize the probability of collisions among different releases/2018/03/ftc-fcc-host-joint-policy-forum-consumer-expo-fight- heavy hitters. Notice that, except for a single channel in which the scourge. [14] FTC. Do not call (dnc) reported calls data, 2019. https://www.ftc.gov/site- client sends the private report obtained from Rbas for a value v, in information/open-government/data-sets/do-not-call-data. all the other channels the client sends randomized reports for the [15] B. Fung. Report: Americans got 26.3 billion robocalls last year, up 46 percent from special value 0 (see Algorithm 6). 2017, 2019. https://www.washingtonpost.com/technology/2019/01/29/report- americans-got-billion-robocalls-last-year-up-percent/. On the server side, the server receives in each channel k the pri- [16] Google. Use caller id and spam protection, 2019. https://support.google.com/ vate reports z(t,k) sent by users for each specific run t. In each round phoneapp/answer/3459196?hl=en. [17] V. Guruswami. List decoding of error-correcting codes: winning thesis of the 2002 and for each channel, the server aggregates the randomized reports ACM doctoral dissertation competition, volume 3282. Springer Science & Business to reconstruct the codeword y whose hash of the original value v Media, 2004. corresponds to channel k. Hence, the decoded value vˆ, if correctly [18] IRS. Phone scams pose serious threat; remain on IRS ‘dirty dozen’ list of tax scams, 2018. https://www.irs.gov/newsroom/phone-scams-pose-serious-threat-remain- reconstructed, should represent the private information sent by (a on-irs-dirty- dozen-list-of-tax-scams. non-negligible number of) users in the k-th channel in a specific run [19] P. Kairouz, K. Bonawitz, and D. Ramage. Discrete distribution estimation under local privacy. In International Conference on Machine Learning, pages 2436–2444, of the SH protocol. The set of reconstructed values is stored in the 2016. set of potential heavy hitters Γ. Due to noisy reports, some values [20] J. P. C. Kleijnen, A. A. N. Ridder, and R. Y. Rubinstein. Variance Reduction in Γ may not be heavy hitters. Techniques in Monte Carlo Methods. Springer US, 2013. [21] H. Li, X. Xu, C. Liu, T. Ren, K. Wu, X. Cao, W. Zhang, Y. Yu, and D. Song. A machine To filter out possible false positives, similarly to the previous learning approach to prevent malicious calls over telephony networks. In IEEE phase the server collects noisy reports wj from users and aggregates Symposium on Security and Privacy (SP), pages 561–577, 2018. them in a single bitstring w. For each reconstructed value vˆ in Γ, its [22] J. Liu, B. Rahbarinia, R. Perdisci, H. Du, and L. Su. Augmenting telephone spam blacklists by mining large cdr datasets. In Proceedings of the 2018 on Asia frequency f (vˆ) is estimated using a frequency oracle (FO) function. Conference on Computer and Communications Security, ASIACCS ’18, 2018. If the computed estimate fˆ(vˆ) is less than a threshold η, then vˆ is [23] S. Pandit, R. Perdisci, M. Ahamad, and P. Gupta. Towards measuring the effectiveness of telephony blacklists. In Network and Distributed System Security Symposium, NDSS, 2018. [24] Z. Qin, Y. Yang, T. Yu, I. Khalil, X. Xiao, and K. Ren. Heavy hitter estimation over set- valued data with local differential privacy. In Proceedings of the 2016 ACM SIGSAC Algorithm 5: ROLH(v,ε): ε-OLH Randomizer Conference on Computer and Communications Security, pages 192–203. ACM, 2016. Input: value v, a hash function H , OLH д parameter, privacy budget ε [25] Blocking. Caller id, sms spam blocking and dialer, 2019. ← ( ) https://play.google.com/store/apps/details?id=com.nomorobo&hl=en_US. 1 x H v % д [26] TrueCaller. Caller id, sms spam blocking and dialer, 2019. https: 2 Sample y ←[д]\{x } uniformly at random. ( eε //play.google.com/store/apps/details?id=com.truecaller&hl=en_US. x w.p. eε +д−1 3 w = д− [27] T. Wang. Sample olh implementation in python, 2018. https: y w.p. 1 //github.com/vvv214/OLH. eε +д−1 4 return w

11 Algorithm 6: SH-Client(v, H,T , K, ε) Algorithm 7: SH-Server(T , K, FO) Input: the m-bit string representation v of the value v to be sent, a fixed list of Input: # of repetition T , # of channels K, a frequency oracle FO, a threshold η hash functions H, # of repetitions T , # of channels K, privacy parameter ε Output: list of heavy hitters Γ /* sending noisy reports for heavy hitter detection */ /* detecting heavy hitters */ 1 Γ ← ∅ 1 for t =1 to T do 2 H ← H[t] 2 for t =1 to T do 3 foreach channel k ∈ [K] do 3 foreach channel k ∈ [K] do 4 foreach user j ∈ [n] do 4 if H (v)=k then (t,k) 5 x=Enc(v) 5 zj ←z value received from user j on channel k; 1 Ín 6 else 6 z= n j=1zj 7 x=0 7 for i =1 to m do ( 1 (t,k)  ε  √ if z[i] ≥ 0 8 z ← Rbas x, m 2T +1 8 y[i]← 1 (t,k) − √ otherwise. 9 Send z to the server on channel k m /* sending noisy report for heavy hitter frequency estimation */ 9 vˆ ←Dec(y)  ε  10 if vˆ < Γ then add vˆ to Γ 10 w← R v, bas 2T +1 /* filtering out false positives */ 11 Send w to the server 11 foreach user j ∈ [n] do 12 wj ←w value received from user j 1 Ín 13 w= n j=1wj ∈ removed from Γ. After this filtering phase, the server can then return 14 foreach vˆ Γ do 15 fˆ(vˆ)← estimate the frequency of vˆ using FO(w) the set of detected heavy hitters. 16 if fˆ(vˆ) < η then remove vˆ from Γ The threshold η plays a crucial role in the heavy hitter detection: 17 return {(v,fˆ(v)) : v ∈ Γ} r 2T +1 log(d)log(1/β) η = (3) ε n 2   xv [rj ] Õ Õ where β [3] is a parameter related to the confidence the server has = Var(w[r ] | r )+ Var(w[r ] | r ) n2 j j j j on the heavy hitters it has detected. The same parameter β also influ- j:vj =v j:vj ,v 2 ences the number of protocol rounds,T [3]. The server-side protocol xv [rj ]  = nf (v)(c2m2x[r ]2 −m2x[r ]2) pseudo-code is represented in Algorithm 7, whereas Algorithm 5 n2 j j refers to the discussion in Section 3.4.  +n(1−f (v))(c2m−02)

B ANALYSIS OF THE BASIC RANDOMIZER and We first show that the frequency estimate fˆ(v) obtained using the n  1 Õ  basic randomizer is unbiased: E[fˆ(v) | r]=E w[rj ]·xv [rj ] | r n n 1 Õ j=1 [ ˆ( )]  ⊺  E f v =E wj xv 1  Õ Õ  n = Ew[r ]·x [r ] | r  + 0 j=1 n j v j j 1  Õ Õ  j:vj =v j:vj ,v [ ⊺ ] [ ⊺ ] = E wj xv + E wj xv 1 Õ n = m ·x[r ]2 . j:vj =v j:vj ,v n j 1 Õ j:vj =v = x⊺x n j v j:vj =v Í 1 Õ j:vj =v 1 = ∥x ∥2 = =f (v), C ANALYSIS n j n j:vj =v OF AREA CODE BUCKETIZATION where xv =c(v) denotes the encoding of item v. Let us first analyze how the probability that the serverC correctly re- We next calculate the variance of the estimate given by the basic constructs a reported phone number depends on the size of the phone randomizer. Let r=(r ,...,r ,...,r ) be a vector of random bits chosen 1 j n numbers space and the number of reports. In this simplified analysis, by user j, where r ∈ [m]. By the law of total variance, for an item j we will assume no noise is added to the data transmitted from the ∈ V ˆ( ) v , the variance of estimate f v is clients to the server. In other words, we will follow the fundamental Var(fˆ(v))=E[Var(fˆ(v) | r)]+Var(E[fˆ(v) | r]) steps of the SH protocol in Algorithm 6, but pretend that the random- 1 = {(c2 −1)f (v)+(1−f (v))c2 } izer (line 8) always returns the true value of one randomly selected bit. n Let us now consider a domain V, in which each value can be c2 −f (v) = , represented using l bits (i.e., |V| =2l ). Also, let us consider a value n v ∈ V transmitted byn clients. For the sake of this simplified analysis, where we have on the server side we can viewv as a sequence of l different bins that Var(fˆ(v) | r) balls n are initially empty, and the bits sent by the clients as . According 1 Õ =Var w[r ]·x [r ] | r to the SH protocol for heavy hitter detection, each client transmits n j v j j=1 only one bit, and therefore the server receives n balls. To correctly n 1 Õ reconstruct the value v, at least one ball must fill each bin. As re- = Var w[r ] | r x [r ]2 n2 j j v j j=1 ported in [5], the number of non-empty bins resulting from randomly 12 inserting n balls intol bins has the following probability distribution: D ANALYSIS OF EXTENDED RANDOMIZER While the frequency estimate fˆ(v) of an item v ∈ V computed from n l  noisy reports generated using the basic randomizer (in line 10 of Al- b b b! Ul,n(b)= , b ∈ {1,...,l} (4) gorithm 6) is unbiased, its variance is often quite large in practice, and ln ∀ this could lead to low accuracy in heavy-hitter detection. Inspired by n the antithetic variates technique in Monte Carlo methods [20], we where  is a Stirling number of the second kind, which expresses b extend the basic randomizer and introduce a new randomizer R the number of ways to partition a set of n elements intob non-empty ext which yields lower variance. The extended randomizer is described subsets.Thenumeratorintheequationexpressesthenumberofways in Algorithm 4. in whichn balls fall exactly inb bins out ofl available ones. Therefore, {n}l! The main difference between the randomizers is in the number of b =l U = l √ √ for , l,n l n gives us the probability that all bins will be filled. different values each user can report. Notice thatzr ∈ {c m,0,−c m} l n √ √ Intuitively, the larger , the larger must be to fill the bins. For in the extended randomizer, while zr ∈ {c m,−c m} in the basic instance, in this simplified analysis, the 34-bit representation of a10- randomizer. The idea behind this modification is that the sum of digit phone numberp would need to be reported by at least 170 users, contributions from users who don’t have itemv to the estimate fˆ(v) for it to have about an 80% probability of being reconstructed at the is non-zero in practice, due to the variance, while in expectation they server side. In reality, the additional noise and the error-correction should cancel out. encoding in the SH protocols further complicate the relationship The following lemma shows that the extended randomizer pro- between l and n. However, it is clear that reducing l also reduces the vides an unbiased estimate of (encoded) item x. number of reports above which heavy hitters can be detected with eϵ 1 eϵ +2 high probability. This motivates our choice of bucketizing phone Lemma 1. Let p = eϵ +2 , q =θ = eϵ +2 , and c = eϵ −1 . The extended numbers by grouping them based on area codes, and by running a randomizer Rext has the following properties: separate instance of the SH protocol per bucket, as only seven digits √ √ (i) For every x ∈ {−1/ m,1/ m}∪{0}, E[R (x)]=x. need to be reported by the SH protocol for each phone number in ext (ii) R satisfies ϵ-LDP for every r ∈ [m]. a bucket. Following the above analysis, 111 reports are sufficient ext to reconstruct 24-bit values (needed to represent 7-digit numbers) The proof of the above lemma is provided in Appendix D.1. with 80% probability, which equates to about a 34.7% reduction in Givenasetofnoisyreportsz1,...,zn generatedbytheextendedran- the number of reports to be received by the server. domizer, the randomizer yields an unbiased estimate of frequency As outlined in Section 4.1, Equation 4 can also be used by the server with smaller variance than the basic randomizer. The following for deciding if the clients that have a report to be sent within a given lemma formalizes this discussion, whose proof appears in Appen- bucket (i.e., if they need to report a caller ID within a given prefix) dix D.1. should actually send the report (using LDP) or not. Considering 24 ∗ ∈ V { }n bits per phone number, as above, and assuming all clients in the same Lemma 2. Let v be an item and wi i=1 be the noisy reports. bucket intend to report the same 7-digit phone number, all buckets ˆ( ∗) 1 Ín ⊺ ( ∗) The frequency estimate f v = n j=1 wj c v has the following receiving less than 84 reports can be easily ignored, because the properties: server will have less than 50% probability of correctly reconstructing ˆ ∗ ∗ a heavy hitter in those buckets. This probability is even lower in (i) E[f (v )]= f (v ) and ˆ ∗ 1  ∗ 2 ∗ 2 practice, since each bucket will likely receive reports about different (ii) Var(f (v ))= n f (v )·(c (p+q)−1)+(1− f (v ))·2c θ , phone numbers. Instructing clients that intend to send a report to where f (v∗) is the true frequency of v∗. “low density” buckets to stop doing so will prevent running the LDP protocol in vain. Thus, those clients can avoid wasting their privacy Two important remarks are in order. First, the extended random- e ε +1 budget for those specific LDP protocol runs. izer Rext reduces to the basic randomizer Rbas if we set c = e ε −1 , Another benefit of grouping phone numbers by area code is that e ε 1 1 p = e ε +1 , q = e ε +1 , and θ = 2 . Second, the above shows that the some spam campaigns tend to use numbers with specific area codes. variance of frequency estimate of an item v∗ ∈ V can be written as a Figure 8 visually shows this tendency. linear combination of two terms: c2(p+q) and 2c2θ. While we wish Figure 2 shows a more comprehensive view of how the relative to find optimal parameter values for c,p,q, and θ that minimize the frequency of phone numbers in the FTC data is amplified when variance, this is not possible because f (v∗) is unknown. Instead, we bucketization is used. Specifically, each vertical line represents the minimize the maximum of those two terms under ε-LDP constraints: frequency of caller IDs appearing in the FTC complaints dataset. The figure on the left shows the occurrence frequency ofphone minimize max{c2(p +q),2c2θ } c,p,q,θ numbers relative to all complaints received in one day, whereas the subject to c(p −q)=1 figure on the right shows how their relative frequency changes after − p −eϵ θ ≤ 0, −p +e ϵ θ ≤ 0 bucketization (notice the different y-axis scales for the two graphs). q−eϵ θ ≤ 0, −q+e−ϵ θ ≤ 0 The take away from this analysis is that bucketization results in the p −eϵ q ≤ 0, −p −e−ϵ q ≤ 0 amplification of the relative frequency of some heavy hitter caller 1−p −q−eϵ (1−2θ ) ≤ 0 IDs and, hence, in the variance reduction of frequency estimates (see −1+p +q+e−ϵ (1−2θ ) ≤ 0 Equation 2), thus increasing the likelihood that heavy hitters will 1 0 ≤p +q ≤ 1, 0 ≤θ ≤ . be correctly reconstructed and detected by the server. 2 13 Solving the above optimization problem gives the following solu- Proof. Let x∗ =c(v∗). We first prove the unbiasedness property. tion: Since wj is an unbiased estimate of xj (i.e., E[wj ]=xj ), it is easy to e ε 1 e ε +2 ∗ p = , q =θ = , c = . (5) see that fˆ(v ) is also unbiased. e ε +2 e ε +2 e ε −1 n  1 Õ  E[fˆ(v ∗)]=E w⊺c(v ∗) n j = Proposition 1. The frequency estimate fˆ(v) of an item v given by j 1 1 n Õ Õ o = E[w⊺x ]+ E[w⊺x∗] Rext has lower variance than that given by R if j j j bas n ∗ ∗ √ j:vj =v j:vj ,v 2 Í a+ 9a −20a+12 j v =v∗ 1 1 Õ 2 : j ∗ ε ≥ ln , = ∥xj ∥ = =f (v ). 1−a n ∗ n j:vj =v where a = f (v), i.e., the true frequency of v. To compute the variance Var(fˆ(v∗)), we condition on random bits The proof of the above proposition is simple and given in Appen- chosen by users. Let r=(r1,...,rj ,...,rn) be a vector, whererj ∈ [m] rep- dix D.1. resents the random bit chosen by user j. By the law of total variance, Var(fˆ(v ∗))=EVar(fˆ(v ∗) | r) +VarE[fˆ(v ∗) | r] Theorem 1. Algorithm 3 satisfies (ε +ε )-differential privacy. n HH OLH 1 Õ = EVar w[r ]·x∗[r ] | r  n2 j j j The proof of Theorem 1 follows from [3, Theorem 3.4] and is j=1 n included in the Appendix D.1 for completeness. 1  hÕ i + Var E w[r ]·x∗[r ] | r n2 j j j j=1 D.1 Proofs for Extended Randomizer 1 = (E[A]+Var(B)). (6) eϵ 1 eϵ +2 n2 Lemma 1. Let p = eϵ +2 , q =θ = eϵ +2 , and c = eϵ −1 . The extended randomizer R has the following properties: The first term is ext n √ √ Õ ∗ 2 (i) For every x ∈ {−1/ m,1/ m}∪{0}, E[Rext(x)]=x. A= Var(w[rj ] | rj )·x [rj ] j=1 (ii) Rext satisfies ϵ-LDP for every r ∈ [m]. Õ ∗ 2 Õ ∗ 2 = Var(w[rj ])·x [rj ] + Var(w[rj ])·x [rj ] ∗ ∗ ∗ Proof. Consider an item v ∈ V on a channel k ∈ [K] and a hash j:vj =v j:vj ,v H V →[K] j H(v )=k Õ 2 2 2 2 2 ∗ 2 function : . For users with j , we have = c m x[rj ] (p +q)−m x[rj ] ·x [rj ]   j:v =v∗ E[Rext(xj )]=E[zj ]=E E[zj | rj ] j Õ 2 ∗ 2 1 ⊺ + 2c mθ ·x [rj ] , = (E[zj [1]], ...,E[zj [m]]) ∗ m j:vj ,v 1  ⊺ = cm(p −q)xj [1], ...,cm(p −q)xj [m] and m m m ( − ) ∗ 1  Õ Õ  =c p q+ xj . E[A]=nf (v )· c2m2(p +q) x [i]4 −m2 x [i]4 m j j ( − ) = E[R ( )] = i=1 i=1 Since c p q 1, we have ext xj xj . For those users with m 1 Õ H(vj ),k, their encoded item xj =Enc(vj )=0, and we have +n(1−f (v ∗)· ·2c2mθ x∗[i]2 m 1 √ √ √ √ i=1 E[z ]= (c mθ −c mθ, ...,c mθ −c mθ )=0=x . ∗ 2 ∗ 2 j m j =nf (v ){c (p +q)−1}+n(1−f (v ))·2c θ . (7) This completes the proof of the unbiasedness of Rext. The second term is n Next, we prove ϵ-LDP of the extended randomizer. Let v1 and Õ  ∗  B = E w[rj ]·c(v )[rj ] | rj v2 be two arbitrary items in V and x1 and x2 be their encodings in j=1 √ √ m {−1/ m,1/ m} ∪{0}, respectively. For any z ∈ {cmx ,0,−cmx }, Õ   ∗ Õ   ∗ r r r = E w[rj ] ·x [rj ]+ E w[rj ] ·x [rj ] we have j:v =v∗ j:v ,v∗   j j Pr[zr | x1,r] p 1−2θ ϵ Õ 2 ∗ 2 ≤ max , =e . = cmx[rj ] (p −q)=nf (v )·cmx[rj ] (p −q), ∗ Pr[zr | x2,r] θ 1−p−q j:vj =v Similarly, and 2 ∗ 2 2 2 2 4   Var(B)=n f (v ) c m (p −q) Var(x[rj ] )=0. (8) Pr[zr | x1,r] 1−p−q θ −ϵ ≥ min , =e . Plugging (7) and (8) into (6) gives the claimed result. □ Pr[zr | x2,r] θ p □ Theorem 1. Algorithm 3 satisfies (εHH +εOLH)-differential privacy. ′ ∗ ∈ V { }n Proof. Fix a user j and two items vj ,v ∈ V held by j. Observe Lemma 2. Let v be an item and wi i=1 be the noisy reports. j ˆ( ∗) 1 Ín ⊺ ( ∗) that, in Algorithm 2, for any fixed sequence H of hash functions each The frequency estimate f v = n j=1 wj c v has the following properties: user j makes a report to KT channels, and each report is generated independently. Among K channels, there exists only one channel (i) E[fˆ(v∗)]= f (v∗) and on which user j sends the noisy report of her true item vj . On the ˆ ∗ 1  ∗ 2 ∗ 2 (ii) Var(f (v ))= n f (v )·(c (p+q)−1)+(1− f (v ))·2c θ , remaining K − 1, user j sends the noisy report of a special item 0. ∗ ∗ ′ where f (v ) is the true frequency of v . Thus, changing the user’s item fromvj tovj changes the distribution 14 hscmltsteproof. the completes This where eo at zeros function quadratic simple a is inequality above reduces inequality above to the terms, the rearranging and Simplifying estimate quency 1. Proposition by bounded is output user’s distribution of ratio for the oracle, reports frequency user’s and of detection hitter independence heavy by Again, server. the to it sends using report another generates by bounded the are all over ratio corresponding the independent, are channels separate over reports user’s Since randomizer. extended the by bounded is distributions two of most at on report user’s of are bars Striped complaints. day. 100 than sample more received a that of numbers phone distribution to related number Telephone 8: Figure ( R hs h nqaiy(1 sstsidwhen satisfied is (11) inequality the Thus, of values the find To of variance The Substituting 3 ext − Proof. 2 a a oe aineta htgvnby given that than variance lower has ) a where , = f sn h aaeesin parameters the Using f ( Telephone number distribution [%] Var v 100 ( v 60 80 20 40 ) t 0 .. h refeunyof frequency true the i.e., , ( ) 0 = f  ˆ exp ≤ ( f 3 v ( h rqec estimate frequency The f e ˆ ( ˆ e 914 a ( ( f ε )) ( v e ε 800 v Var ( ( < = = v ε and 813 − ) t 2 ) ε ε

− 703 T )− 1 o h ai admzris randomizer basic the for = n n ε 1 1 1 ie yteetne randomizer: extended the by given ( 2 844 ≥ f h udai function quadratic The . uhta (9) that such ε T 1 ) ˆ ≥  

1 609 a HH 2 ( ) f a f ln ) v 612 ± ln e  ( ( 2 855 )) = v v 2 ) T + p a

ε 617 a exp = = )·( )· osbespammer Possible +

+ 860 a f hnes n nec hne h ratio the channel each on and channels, + 2 ( n  866 exp e 1 f 2 c ( ( √ √ 3 v

( 504 e − ε ( 2 OLH ( ( e v ( 9

( 505 9 ε ε − ) e 4 ε p  1 ) a a eseta h ...tr fthe of term l.h.s. the that see we , 510 HH ( + ε ( exp e − e e 1 − + 1 1 ε Prefix a 512 2 2 − ε ε ε − 1 HH 2 ) − q

a 877 − − ≤ − + 1 2 + − ) hc satisfies which , )· ) a )− 516 (5) 2 a ) 20 1 ( 20 ( 1 1 646 1) eset we (10), ≤  3 exp ) )( ε 1 

2 888 + o rqec rce user oracle, frequency For . − a HH egttevrac ffre- of variance the get we , a 2 ) T 3 v  404 2 + + − 2 ( − + e e 405 f f . e ( ( ˆ ( ) f 12 R ε ε e 407 2 ε 1 ε ( ( 12 ythe by v ( v ε a OLH 415 − − + − v bas others . + )≤ )) 667 ) ) f 1 ) , 1 1 305 . ) 2 ) fa item an of ( 2 v  ) 310 д . д if ) 0  2 312 ))· = ( scnaeadhas and concave is . .

− 313 ε t

2 443 ) LPpoet of property -LDP exp f c 201 = ε 2 ( 202 OLH v θ ( 323 KT a (

)

ε 203 . −

HH 206 v LP and -LDP, 1 channels ) ie by given + t 2 ε + OLH at (11) (10) (9) □ □ + ) j . 15 hs rpsaecmue ae npoenmesetatdfrom extracted numbers phone on based computed are graphs These The figure is related to a sample day worth of reports. As canbeseen, reports. worth of day toasample is related figure The a niae ifrn racd rfx h tie oto ofthe portion Thestriped prefix. vertical code Each area 5). different Section a in indicates provided bar are use details we (more data FTC FTC the the to about residents US from reports call unwanted code. area same the within reported numbers phone all of by number made total the calls to compared day, a make in that calls hundred numbers one phone than more of frequency relative the shows 8 Figure PROPERTIES DATASET E Figure9:NumberofdailycomplaintsreceivedbetweenFeb.17thandMar.17th. htrcie udeso opansi igeday. single a in complaints numbers of phone hundreds many received received exist that numbers also there phone but of complaint, majority daily vast single the a that see to dataset. easy the is in It included observation of period entire the throughout received have numbers phone Specifically, many ID. caller per complaints the of dis- number the the shows 10 of Figure tribution weekends. com- the of around number received lower is much plaints a which in pattern weekly a showing bucket. one just in diluted numbers be 10-digit would all it considered whereas we codes, if area respective their frequency in occurrence high relative Their is prefixes. of number limited a only appear in complaints hundred one than more with numbers phone day. a in times hundred one than more numbers about to complained were related that complaints of fraction relative the indicates bars iue9dpcstenme fvldrprsrcie ahday, each received reports valid of number the depicts 9 Figure x ai it h ubro opans n the and complaints, of number the lists -axis iue1:Dsrbto fdiycmlit e alrID. caller per complaints daily of Distribution 10: Figure Telephone numbers Reports 15000 20000 10000 10 10 10 10 10 10 5000 4 5 0 1 2 3 1 0

41 02/17/16 02/19/16 81 02/21/16

122 02/23/16 o fComplaints of No. 02/25/16

165 02/27/16 02/29/16 Day 216 03/02/16 x

opansi igeday, single a in complaints 03/04/16 286 03/06/16

403 03/08/16 03/10/16 583

y 03/12/16 ai hwhow show -axis

1095 03/14/16 03/16/16