<<

1 PDA: Semantically Secure Time-Series Data Analytics with Dynamic Subgroups

Taeho Jung1, Junze Han1, Xiang-Yang Li12 1Department of Computer Science, Illinois Institute of Technology, Chicago, IL 2Department of Computer Science and Technology, TNLIST, Tsinghua University, Beijing

Abstract—Third-party analysis on private records is becoming increasingly important due to the widespread data collection for various analysis purposes. However, the data in its original form often contains sensitive information about individuals, and its publication will severely breach their privacy. In this paper, we a novel Privacy-preserving Data Analytics framework PDA, which allows a third-party aggregator to obliviously conduct many different types of polynomial-based analysis on private data records provided by a dynamic sub-group of users. Notably, every user needs to keep only O(n) keys to join data analysis among O(2n) different groups of users, and any data analysis that is represented by polynomials is supported by our framework. Besides, a real implementation shows the performance of our framework is comparable to the peer works who present ad-hoc solutions for specific data analysis applications. Despite such nice properties of PDA, it is provably secure against a very powerful attacker (chosen-plaintext attack) even in the Dolev-Yao network model where all communication channels are insecure.

Index Terms—Privacy-preserving Computation, Multi-party Computation, Secure data analysis !

1 INTRODUCTION Even if data recipients do not misbehave on the data, such Great amount of data is collected for the information-based protection does not guarantee the protection, in the sense decision making in our real life these days (Smart Grid that a profitable sensitive data may become the target of [25], Social Network Services [19], Location Based Services various types of attacks, and the contracts or agreements [56] etc.). This trend originated from the business needs. cannot guarantee any protection in this case. Enterprises have been increasingly looking to discover the Various previous works focus on the data anonymization answers to specific business questions from the data they [46], [50] to protect the identities of the data owners, but possess, and various data mining techniques are applied to the de-anonymization [26], [29] is its powerful countermea- derive the answers (e.g., marketing [8] and health care [39]). sure, and it cannot provably guarantee individual privacy. Driven by such needs, publication, sharing and analysis In this paper, we focus on a more substantial aspect of of user-generated data has become common. For exam- privacy implications in the third-party data analysis – the ples, President Information Technology Advisory Commit- confidentiality of the data itself. We investigate the privacy tee (PITAC) released a report to establish a nation-wide preserving data analysis problem in which a third-party electronic medical records sharing system to share the med- aggregator wants to perform data analysis on a private ical knowledge for various clinic decision making [20]; AOL dataset generated by any subgroup of total users in the published its anonymized query logs for research purpose system, where the total users in the system may change [5]; Netflix released anonymized dataset containing 500,000 dynamically. To enable this without privacy concerns, it is arXiv:1308.6198v3 [cs.CR] 25 Sep 2015 subscribers’ movie ratings to organize a public contest for desirable to have a data analysis tool which can provably the best recommendation algorithm [7]. However, user- guarantee the data privacy without bringing distortion to generated data in its original form contains much sensitive the analysis result such that the aggregator will get the most information about individuals ([2], [24]), and the publication utility out of it. Various types of solutions based on different of such data has brought active discussions in both industry techniques are proposed to support such privacy preserving and academia [15], [18], [45]. Current primary practice is computation (secure multi-party computation, functional to rely on 1) sanitization to clean up the sensitive informa- , perturbation etc., all reviewed in Section 8), but tion in the publication [33], or 2) regulation enforcement we have the following design goals to make our scheme to restrict the usage of the published data [34], [37]. The applicable in a realistic environment, and most of existing former approach may bring distortion to the data during solutions fail to achieve one or more goals. the sanitization which leads to accuracy loss, and the latter Design Goals one solely relies on the legal or voluntary agreements that the data should be used only for agreed purposes. The Varieties: Two varieties need to be considered. In reality, drawback of those protections is two folds. 1) A good trade- the actual group of users generating data changes rapidly off between the utility and the privacy is hard to find; 2) along the time, and the type of analysis itself also changes depending on the characteristics of the dataset. Therefore, a *Dr. Xiang-Yang Li ([email protected]) is the contact author of this paper. scheme needs to adapt itself to such two varieties. 2 Time-series Data: Users continuously generate data, there- (gk)c = (gk)ab given a triple ((gk)a, (gk)b, (gk)c), and any al- fore the data analysis scheme should preserve the data gorithm A’s advantage in solving the k-DDH problem is denoted k−DDH confidentiality even when the data is a time series and may by advA . be substantially self-correlated. Definition 2. A number w is called N-th residue modulo N 2 if Communication & Computation Overhead: Considering ∗ N 2 there exists y ∈ 2 such that w = y = mod N . the rate at which data is generated and collected these days, ZN the analysis on the published data should be efficient in Recall that the set of N-th residues modulo N 2 is a cyclic ∗ terms of both computation and communication overhead. multiplicative subgroup of ZN 2 of order φ(N). Every N-th Channel Security: In dynamic environments such as residue has N roots, and specifically, the set of all N-th roots VANET, establishing pair-wise secure channels is almost of the identity element forms a cyclic multiplicative group x 2 impossible due to the huge overhead. Therefore a scheme {(1 + N) = 1 + xN mod N |x ∈ ZN } of order N. cannot rely on secure channels to achieve privacy. Lemma 1. There exists a function Privacy-requirement Independent Accuracy: Distorted re- sult is not accepted in many cases, therefore we require x 2 D : {(1 + N) mod N |x ∈ ZN } → ZN accurate result with zero or negligible distortion from the published data, which is not a function of the guaranteed which can efficiently solve the discrete logarithm within the ∗ privacy level. This property will enable us to guarantee the subgroup of ZN 2 generated by g = 1 + N. proposed security (IND-CPA) all the time, hence trade-off x 2 between the security and utility is unnecessary. Proof. For any y = (1 + N) , y = (1 + xN) mod N . Then, In this paper, we design a framework for Privacy- the function D(y) = (y − 1)/N = x efficiently solves the preserving Data analytics (PDA), a cryptographic tool discrete logarithm within the subgroup, where ‘(y − 1)/N’ which enables third-party analysis on private data records is the quotient of the integer division (y − 1)/N rather than without privacy leakages. The main contributions of this y − 1 times the inverse of N. paper are summarized as follows. Definition 3. Decisional Composite Residuosity (DCR) problem 1) Instead of presenting ad-hoc solutions for many appli- in the group of N-th residues modulo N 2 with generator g is to cations, PDA enables any type of data analysis that is decide whether a given element x ∈ NRN 2 is an N-th residue e.g. 2 based on polynomials ( , statistics calculation [55], re- modulo N . gression analysis, and support vector machine classifi- cation) to be performed on a private dataset distributed among individuals. 2.2 Computational Assumptions 2) We propose a provably secure cryptographic frame- N is an RSA modulus of two safe prime numbers p, work which guarantees semantic security against (N = pq, and p, q satisfy p = 2p0 + 1, q = 2q0 + 1 for some chosen-plaintext attack. Specifically, any polynomial large prime numbers p0, q0 [14]). Then, the DDH assumption adversary’s advantage is polynomially bounded by the states that if the factorization of N is unknown, the DDH advantage of solving a Decional Deffie-Hellman prob- ∗ problem in ZN is intractable [10], [35]. That is, for any lem. DDH probabilistic polynomial time algorithm (PPTA) A, advA 3) Although we support many different formats of is a negligible function of the security parameter κ. polynomial-based data analysis among any subset of n users, every user needs to hold only O(n) keys, and the Lemma 2. When N is a semiprime of two safe prime numbers computation & communication complexities for adding p, q and the factorization N = pq remains unknown, we have an extra user to existing group of n users are both DDH ≤P k-DDH. O(n) while removing an existing user does not require k extra operation. Besides, on average the communication Proof. Given an oracle solves -DDH problem, if one knows k ((ga)k, (gb)k, (gc)k) gc = gab overhead for an ordinary user is less than 1 KB per , he can submit to this oracle. product term in the polynomial. if and only if the oracle outputs 1. Under the same condition, the DCR assumption states ∗ 2 PRELIMINARIES that the DCR problem in ZN 2 is believed to be intractable if the factorization of N = pq is unknown [52]. That is, for 2.1 Notations DCR any PPTA A, advA is a negligible function of the security We use ZN to denote the additive group {0, 1, ··· ,N − 1}, pararmeter. In this paper, we rely on the above assumptions ∗ and ZN to denote the invertible residues modulo N. Also, to define and prove the security. |G| and |P| are used to denote the order of group G and the size of the set P respectively. Recall that, given the conventional notation of Euler’s totient function φ(N), 3 PROBLEM DEFINITIONS ∗ |ZN | = φ(N). When demonstrating the reducibility between 3.1 Problem Modelling the computation problems X,Y , we use X ≤P Y to denote W.l.o.g., we assume there are total n users, and each of them that X is polynomially reducible to Y , and X ≡ Y to denote is represented by an integer ID:i ∈ {1, ··· , n}. Each user that X is polynomially equivalent to Y . generates time-series data records xi,t at time t, but we use Definition 1. k-th Decisional Diffie-Hellman (k-DDH) prob- xi to denote his data value for visual comfort. A third-party ∗ lem in the group ZN with generator g is to decide whether aggregator wishes to choose an arbitrary subset of users 3 P ⊆ {1, ··· , n} to conduct certain data analysis on their params given a security parameter κ as input. dist. private data records by evaluating: KeyGen(params) −−−→{EK1, EK2, ··· , EKn} is a probabilistic i X  Y eik  and distributed algorithm jointly run by the users. Each user f(xP ) = ck x (1) ik will secretly receive his own secret EKi. k i∈P Encode(xik, f(·), EKi,Tf ) → C(xik) is a deterministic algo- where xik is user i’s private value for the k-th product term rithm run by each user i to encode his private value xik into a Q eik xik , and we call P as the participants of the polynomial communication string C(xik) using tk ∈ Tf . The output C(xik) evaluation. Note that Eq.(1) can be used to represent any is published in an insecure channel. P Q eik general multivariate polynomial because the coefficients Aggregate({C(xik)|∀k, ∀i ∈ P}) → f(xP ) = (ck xik ) ck’s and the exponents eik’s are tunable parameters. is run by the aggregator to aggregates all the encoded private To model the time-variation, we use a discrete time values {C(xik)}’s to calculate f(xP ). domain T = {t1, t2, ···} of time slots, and when the ag- We use ‘Encode’ and ‘Aggregate’ instead of conventional gregator submits the query polynomial to users, he will also terminologies ‘Encrypt’ and ‘Decrypt’ because we will em- declare a sequential time window Tf ⊂ T associated with ploy a homomorphic encryption as a black-box algorithm the polynomial in order to request users’ data generated later. during the time window Tf . Definition 5. PDA is correct if for all P ⊆ {1, ··· , n}: 3.2 Informal Threat Assumptions  params ← Setup(1κ);  {EK }n ← KeyGen(params); Pr i i=1 = 1 Due to privacy concerns, the analytic result should be  ∀k, ∀i : C(xik) ← Encode(xik, f(xP ), EKi,Tf );  only given to the aggregator, and any user’s data should Aggregate({C(xik)|∀k, ∀i}) = f(xP ) be kept secret to anyone else but the owner unless it is where the probabilities are taken over all random bits consumed deducible from the polynomial value. Besides, both users by the probabilistic algorithms. and the aggregator are assumed to be passive and adaptive adversaries. They are passive in that they will not tamper The security of PDA is formally defined via following the correct computation of the protocol. If users tamper the data publishing game. protocol, it is highly likely that the aggregator will detect Data Analysis Game: it since the outcome of the protocol will not be in a valid Setup: 3 disjoint time domains are chosen: T1 for phase range (e.g., mean age= 3, 671.9) due to the cryptographic 1, T2 for phase 2, and Tc for the challenge phase. operations on large integers, and the aggregator is interested Init: The adversary declares his role in the PDA scheme in the correct analytic result, so they will not tamper the (i.e., aggregator or user), and the challenger controls the protocol itself. Furthermore, users may report a value with remaining participants. Both of them engage themselves in small deviation such that the analytic result still ‘looks’ the Setup and KeyGen algorithms. good due to many reasons (moral hazards, unintentional Phase 1 in T1: The adversary submits polynomially mistake, etc.), but evaluating the reliability of the reported many calls to the encode oracle1 based on arbitrary query value is out of this paper’s scope. On the other hand, they (f(x),Tf ⊆ T1, {xik}i∈P,k). If the declared time windows are adaptive in that they may produce their public values do not overlap with each other, he receives all correspond- adaptively after seeing others’ public values (i.e., passive ing communication strings from the oracle. Otherwise, he rushing attack [30]). Besides, we also assume the users will receives nothing. not collude with each other or the aggregator in our basic Challenge in Tc: The adversary declares the target poly- ∗ construction, and this assumption will be relaxed in the nomial f (xP ) to be challenged as well as its time window advanced version in the later section. Tf ∗ ⊆ Tc. Then, he submits two sets of values xP1, xP,2, ∗ ∗ As aforementioned, we assume the Dolev-Yao network such that f (xP,1) = f (xP,2), to the challenger. The model. That is, all the communication channels are open challenger flips a fair binary coin b and generate the corre- to anyone, and anyone can overhear and synthesize any sponding communication strings based on xP,b, which are message. This is a reasonable assumption since in a dynamic given to the adversary. network environment in which actual group of available Phase 2 in T2: Phase 1 is repeated adaptively, but the users may change (e.g., sensor networks, crowdsourcing time window Tf should be chosen from T2. environment), establishing pairwise secure communication Guess: The adversary gives a guess b0 on b. channels is costly and may not be possible due to limited The advantage of an algorithm A in this game is defined resources. PDA 0 1 as advA = Pr[b = b] − 2 . Intuitively, we assumed the worst-case scenario (i.e., most powerful attacker) in the 3.3 Privacy-preserving Data Analytics: PDA above game where even an adversarial user can also specify To formally define the correctness and the security of our the target polynomial to be evaluated, and the polynomial framework, we first present a precise definition of PDA. can be chosen adaptively to increase the advantage. In real- life applications, polynomials are chosen by the aggregator Definition 4. Our Privacy-preserving Data Analytics (PDA) is the collection of the following four polynomial time 1. The oracle generates and gives the communication strings at the algorithms: Setup, KeyGen, Encode, and Aggregate. adversary’s request based on the queried polynomial f(xP ), its asso- κ ciated time window Tf , and the input values {xik}i∈P,k. Challenger Setup(1 ) → params is a probabilistic setup algorithm that is can act as such encode oracle in this game for the users that are not run by a crypto server to generate system-wide public parameters under adversary’s control. 4 P Q only, and the choice will be monitored by the users since the to E( k i xik), and only the aggregator can decrypt it coefficients and powers are public. because other users do not have the private key. On the other hand, the aggregator cannot obtain values of any Definition 6. Our PDA is indistinguishable against chosen- single product term because the are perturbed. plaintext attack (IND-CPA) if all polynomial time adversaries’ Letting the users independently obtaining random advantages in the above game are of a negligible function w.r.t. Q noises rik such that i∈P rik = 1 is the most challenging the security parameter κ when T1,T2, and Tc are three disjoint part. To do so, we exploit the sequential time window Tf time domains. associated with the polynomial f(·). All users will pick the Note that our security definition does not guarantee k-th time slot tk in Tf to generate rik for his xik. Since the that the private values cannot be deduced from the output random noise is based on tk, we need different time slots values. Even a perfectly secure black-box algorithm cannot for all product terms. This implies we have to consume Tf guarantee it if the same set of k values are fed as inputs for of size m for a polynomial f with m product terms. Also, all k different polynomials’ evaluation (a simple linear system). sequential time windows for the evaluation must be disjoint Rather, we use IND-CPA to define the security which is to each other to guarantee that no time slots are used for shown to be equivalent to the semantic security [36]. In twice. other words, our framework guarantees that any adver- sary’s knowledge (either adversarial aggregator or user) on 4.2 Constructions the probability distribution of the private values remains Firstly, we define the Lagrange coefficient for i ∈ Zn and a unchanged even if he has access to all the communication Q x−j set, P, of integers in Zn as Li,P (x) = j∈P i−j , and use strings generated by our PDA. The indistinguishability j6=i the simplified Lagrange coefficient Li,P for a special case becomes particularly crucial when the data is generated Li,P (0) for the visual comfort. continuously along the time, and the same user’s data may Then, our provably secure framework PDA works as be temporally correlated. Unless guaranteed by the semantic follows. Note that modular operations are omitted unless security, the data privacy is hardly preservable. necessary for the sake of simplicity. Setup 4 ACHIEVING CORRECT &SECURE PDA A crypto server picks two safe semiprimes N = pq, N˜ =p ˜q˜ In this section, we first present a basic construction with- such that p, q are of same length, p,˜ q˜ are of same length κ out colluding users for simplicity, and further presents the bits, and φ(N)/N˜ = k is an integer. Then, he randomly picks countermeasure to collusion attacks in Section 7.3. ∗ ∗ a generator g of the group ZN , a generator g˜ of ZN˜ , and sets h = gk. Subsequently, he randomly chooses a hash function 4.1 Intuitions H : Z → hhi, which is modelled as a random oracle. Finally, The following two steps show the big sketch of our design he publishes params = (g, g,˜ N, N,H˜ ) and destroys p, q, p,˜ q˜ on the oblivious polynomial evaluation. such that they are no longer available. Hiding each value: To hide every private value xik ∈ ZN ˜ Lemma 3. hN = 1 mod N. in the polynomial f(xP ) (Eq.1), we let each user i publish ˜ ˜ the perturbed data xikrik, where rik is an independently Proof. hN = gkN = gφ(N) = 1 mod N Q chosen random noise, but it satisfies that rik = 1 with i∈P Any party independent from aggregator or users is modulo operation using our novel technique. qualify as the crypto server, for example, Internet Service It follows that the product of all perturbed data is equal to Q Providers or governments. i xik for all k. Hiding each product term: To hide the product terms in Eq. Key Generation (1), we employ two users with special roles. The first special Aggregator’s Key Generation: The aggregator generates a user i1 will publish homomorphically encrypted perturbed pair of additive homomorphic encryption keys PK, SK, and data E(xi kri k) instead of his perturbed data using the 1 1 publishes PK only. This additive homomorphic encryption aggregator’s key. Since the plaintext of this is a must have the following properties: perturbed data, the aggregator does not learn xi1k. On the other hand, because only the aggregator has the private key EPK(m1) · EPK(m2) = EPK(m1 + m2) Q 2 m2 of the ciphertext, no one else learns xi1k or i xik either . EPK(m1) = EPK(m1m2) The second user i2 does not publish his perturbed data where EPK(m) denotes the ciphertext of m encrypted with either. Instead, he conducts homomorphic operations: Q PK. Any additive homomorphic encryption with message xikrik Y E(x r ) i6=i1 = E( x ) i1k i1k ik space ZN is accepted (e.g., Paillier’s [52]). i Users’ Key Generation: All users participate in the follow- for all k, where E(m) denotes the ciphertext of m en- ing secret sharing. crypted with a homomorphic encryption scheme. Then, he ∗ 1) Every user i (i ∈ {1, ··· , n}) randomly picks ri ∈ ZN˜ , Q r ∗ locally picks random numbers Kk such that k Kk = 1 i and broadcasts yi =g ˜ ∈ ZN˜ via insecure channel. (modulo operation), and publishes perturbed ciphertexts 2) Then, every user i uses yi+1 and yi−1 to locally calcu- Q −1 r ∗ KkE( i xik) to the aggregator. The product of them is equal i late Yi = (yi+1 · yi−1) ∈ ZN˜ , where the inverse is over modulo N˜. Specially, user 1 uses y2 and yn, and user n 2. Not all xiri terms are published, therefore the product of pub- Q uses y1 and yn−1. lished perturbed data does not yield i xik 5 Qn ˜ Qn N˜ ˜ 2 Lemma 4. i=1 Yi = 1 mod N, i=1 Yi = 1 mod N After the aggregator declares the target polynomial P Q eik  Proof. f(xP ) = k ck i xik as well as the corresponding time window Tf , all users examine the time window. If n Y (r −r )r (r −r )r 0 Tf overlaps with any of the consumed time windows, all Y =g ˜ 2 n 1 ··· g˜ 1 n−1 n =g ˜ = 1 mod N˜ i users abort. Otherwise, every user i ∈ P except user 1 & 2 i=1 computes and publishes his encoded value C(xik) for all k: Qn ˜ Then, i=1 Yi = 1 + mN for some integer m, and (|P|−1) ˜ ˜ eik q (i)Li,P ∗ 3 Qn N ˜ N ˜ 2 ∀k : C(xik) = x H(tk) ∈ N i=1 Yi = (1 + mN) = 1 mod N as in Section 2.1. ik Z Subsequently, every user j randomly defines a polyno- where tk is the k-th time slot in Tf . At the same time, user 2 (d) ∗ computes and publishes: mial qj (x) over ZN˜ , whose degree is d and the constant (d)   ∗ term is 0 (i.e., qj (0) = 0). That is, every user randomly ∀k : EPK C(x2k) ∈ N 2 ∗ Z picks the coefficients for its own polynomial from ZN˜ . This is repeated for all d = 2, ··· , n − 1. Then, each user Subsequently, user 1 computes C(x1k)’s for himself and i executes the encoding key query (Sub-Algorithm 1), in computes all Vk’s for the product terms as shown below: which all other users jointly interact with him to respond Q eik i∈P,i6=2 C(xik )  e2k   Y eik  to the query. The query outputs the specific data points ∀k : Vk = EPK C(x2k ) = EPK xik on the secret global polynomials q(2)(x), ··· , q(n−1)(x) for i∈P each user, where the global polynomial is the sum of all ∗ Subsequently, he randomly selects Kk ∈ ZN 2 such that individual polynomials. Q 2 ck k Kk = 1 mod N , and he publishes KkVk for all k. Even if a user i’s xik does not appear in a product term Sub-Algorithm 1 EKi Query Q eik xik , he still sets xik = 1 and participates in the analysis 1: Every user j 6= i computes and publishes by following the algorithm to guarantee the correctness of ˜ (d) the final aggregation. N ˜ qj (i) ˜ 2 Qij = Yj (1 + N) mod N Aggregation 2: After seeing all the broadcasts, user i calculates: When all users finish their encoding and broadcasting, the ˜ (d) Pn (d) N ˜ qi (i) Y ˜ j=1 qj (i) ˜ 2 Yi (1 + N) Qij = (1 + N) mod N aggregator conducts the following homomorphic opera- j6=i tions: ! Y ck X  Y eik  2 Subsequently, he uses aforementioned discrete loga- KkVk = EPK ck xik mod N (d) P (d) ˜ rithm solver D(·) to solve q (i) = j qj (i) mod N k k i∈P from it, which is set as one of his encoding keys. Later, Then, he can use SK to decrypt the polynomial value. q(d)(i) will be used to generate the data masker when analysis in the appendix show that PDA is correct and d + 1 users’ data is to be analyzed. indistinguishable against chosen-plaintext attack. Note that Step 1 & 2 are repeated for d = 2, ··· , n − 1. the lowest degree in a user’s encoding key EKi is 2, and this (2) (n−1) Output: EKi(q (i), ··· , q (i)) forces at least three users to participate in the polynomial evaluation. This is because our scheme cannot guarantee Essentially, in the user’s key generation, each user j ran- the data privacy against other users when only two users (2) (n−1) domly generates polynomials qj (x), ··· , qj (x) to re- participate in the evaluation (Section 7.5). spond others’ key queries, and each key recipient i receives one data point for each secret polynomial q(d)(i), where the 4.3 Supporting Dynamic User Group (d) polynomial q (x) remains unknown to everyone because Up to this point, our design is able to support data analysis it is merely the sum of all users’ random polynomials. in dynamic subgroup of a fixed group of n users. A more However, specific point on this unknown polynomial can substantial ability is to support a dynamic group of users, be jointly computed as shown in the query. i.e., to enable new users to join the system and old users to Lemma 5. For any set of users P, we have: leave the system.

X (|P|−1) q (i)Li,P = 0 mod N˜ Adding a new user n + 1 i∈P When a new user whose ID is n + 1 needs to join the group Proof. Recall that the constant terms in all polynomials are of users {1, ··· , n}, the number of participants in one round 0. Then, due to the polynomial interpolation, the above sum of data analysis can be as large as n + 1. The first step to equals q(|P|−1)(0) = 0 mod N˜. do is to update the key set of every user i in the old group {1, ··· , n}: each user i needs to acquire and add one extra

Input Encoding ∗ 3. C(xik) may not be in ZN when xik is not invertible under modulo Recall that we employ two users with special roles. For the N, but this occurs only when xik is not co-prime to n, i.e., xik is equal simplicity of the presentation, we assume that the first two to p or q, two large safe prime numbers. The chance this happens is 1 , and this is also the case where the large RSA number n is users (user 1 & 2) have to participate in the polynomial eval- pq factorized. We simply ignored this chance because it is a negligible uation and act as the special users, but any other ordinary function of the security parameter, and it is also commonly believed user can replace them. that the factorization is intractable. 6

(n) [y] [y]·2−e item q (i) into his key set EKi for the worst case where • EPK([x]) ⇒ EPK([x]) all of n + 1 users participate in the analysis (so that their where C(m) is the encoded communication string of [m] (n) q (i)’s can be used when such case occurs). The second (Encode algorithm), and EPK(m) denotes the ciphertext step is to issue the new user n + 1 his full key set EKn+1 = of m encrypted with PK (Key Generation algorithm). We {q(2)(n + 1), ··· , q(n−1)(n + 1), q(n)(n + 1)} so that he can do not use floating point representation because it is hard participate in any data analysis consisting of 2, 3, ··· , n + 1 to implement the arithmetic operations on floating point participants. numbers. Although this conversion gives practically high For the first step, all users in the new group perform precision, one may consider employing [3], [32] for much the aforementioned secret sharing in the Key Generation, higher precisions. (n) and every user j ∈ {1, ··· , n + 1} randomly defines qj (x) whose degree is n and the constant term is 0. Then, for each 4.5 Data Analysis via PDA: Examples user i ∈ {1, ··· , n + 1}, every user j 6= i computes and pub- We only present some examples in this section since in- lishes Qij of degree n as in Sub-algorithm 1, and i calculates finitely many different analysis are supported and it is not q(n)(i) in the same way. When this is repeated for all users, possible to enumerate all of them here, but any data analysis every user i in the old group {1, ··· , n} has a complete based on polynomials can be conducted via our framework. q(2)(i), ··· , q(n)(i) key set which contains key items , and Statistics Calculation: Various statistical values including, n + 1 q(n)(n + 1) the new user only has . but not limited to mean, variance, sample skewness, mean j 6= i For the second step, every user performs the weighted deviation are all polynomials of the data following process for d = 2, ··· , n − 1. User j re-uses (d) values, and all of them can be privately evaluated by a his previously defined individual polynomial qj (x) to third-party aggregator without knowing individual data by (d) calculate qj (n + 1) and publishes Qn+1,j of degree d as leveraging our PDA. in Sub-algorithm 1. Subsequently, user n + 1 calculates the Regression Analysis: Regression analysis is a statistical product and acquire the q(d)(n + 1) in the same way. When process for estimating the relationship among the variables. this is repeated for d = 2, ··· , n − 1, the new user acquires Widely used polynomial regression and ridge regression q(2)(n + 1), ··· , q(n−1)(n + 1). both belong to it. In both, each user i’s data record is Finally, 2n + 1 new items are shared among n + 1 users, described as a feature vector x and a dependent variable and every user i in the new group has his own n+1 items in yi, and training a regression model is to find p which min- P 2 the key set EKi, which is the full key set for the new group imizes MSE(p) = i(yi − pxi) , i.e., the linear predictor of users {1, ··· , n + 1}. who predicts users’ dependent variable vector y using their feature matrix X with minimum mean squared error. Since Adding n0 new users MSE(p) is convex, it is minimized if and only if Ap = b, 0 T T P T n where A = X X and b = X y. Further, A = i xixi , Batch arrival can be handled in a similar way. When P new users arrive at the same time, one round of secret and b = i yixi. Therefore, the aggregator can obliviously sharing occurs in the new group of n + n0 users. As the evaluate them with several calls to our PDA. Note that any 0 first step, n0 extra key items q(n)(i), ··· , q(n+n −1)(i) are polynomial regression model or ridge regression model can delivered to every i ∈ {1, ··· , n + n0}. Then, as the second be trained in the same way. step, every new user i ∈ {n + 1, ··· , n + n0} acquires Support Vector Machine: Support Vector Machine (SVM) is q(2)(i), ··· , q(n−1)(i) as aforementioned. a supervised learning model which is widely used to classify the data records. Similar to the regression analysis, a user Removing one or more existing users i’s data record is described as a feature vector xi and its class yi ∈ {−1, 1}, and training a SVM is to find a vector When one or more users are removed or leave the group, w and a constant b such that the hyperplane wx − b = 0 the system can simply ignore the removed users, and keys separates the records having yi = −1 from those having of the rest users are not affected by the removal. 2 yi = 1 with the maximum margin |w| . The dual form of this optimization problem is represented as maximizing L(a) in 4.4 Representing Real Numbers a = {a1, ··· , an}: n Due to the cryptographic operations, all the computations X 1 X T L(a) = a − a a y y x x and operations in our scheme are closed in integer groups, i 2 i j i j i j i=1 i,j and thus we need to use integers to represent real numbers. Pn We exploit the homomorphism in PDA to represent real subject to i=1 aiyi = 0 and 0 ≤ ai ≤ C. This is solved numbers via integers using the fixed point representation by the Sequential Minimum Optimization (SMO) algorithm [51]. Given a real number x, its fixed point representation is [53] with user-defined tolerance, who greedily seeks for the given by [x] = bx · 2ec for a fixed e, and all the arithmetic local maximum iteratively until convergence with certain operations reduce to the integer versions as follows: error. Due to the equality constraint, L(a) becomes a one- • Addition/Subtraction: [x ± y] = [x] ± [y] dimensional concave quadratic function in each iteration, ±1 ±1 ∓e • Multiplication/Division : [x · y ] = [x] · [y] · 2 which can be trivially optimized. The coefficients of the where x · 2−1 stands for x divided by 2. Then, we have the quadratic function are polynomials of all users’ feature following conversions for the arithmetic operations used in vectors and class values, and the aggregator can obliviously our schemes if using fixed point representations. calculate them using our framework and optimize L(a) at −e • C([x])C([y]) ⇒ C([x])C([y]) · 2 . each iteration locally. 7

5 CORRECTNESS &SECURITY PROOFS that Qij leaks no statistical information about the hidden (d) 5.1 Correctness Proof random polynomial qj (i) as follows. ˜ m y = Y N (1 + N˜)m Theorem 1. Our PDA is correct. We first prove that, given and j only (except Yj ), deciding whether m is the correspond- Proof. We show by steps that each algorithm correctly cal- ing hidden value in y (denoted as D-Hidden problem) is culates the output, which guarantees the correctness of our equivalent to the DCR problem. We suppose there is an scheme. oracle OD−Hidden who can solve the D-Hidden problem, Key Generation: Correctness of the homomorphic encryption and another oracle ODCR who can solve the DCR problem. is already shown [52], therefore we focus on the user’s key D-Hidden ≤P DCR: Since (1 + N˜) is an N˜-th non- generation. Due to Lemma 4, we have: residue, y is an N˜-th residue if and only if m = 0 mod N˜. ˜ −m n Then, given y and m, one can submit y(1 + N) to ODCR ˜ (d) ˜ Pn (d) N ˜ qi (i) Y Y N ˜ j=1 qj (i) N˜ m Yi (1 + N) Qij = ( Yj )(1 + N) to see if it is an -th residue to decide whether is the j6=i j=1 hidden value. Pn q(d)(i) 2 DCR ≤P D-Hidden: Similarly, given x in the DCR prob- = (1 + N˜) j=1 j mod N˜ lem (deciding if x is an N˜-th residue), one can submit m = 0 Further, the existence of the discrete logarithm solver and x to OD−Hidden to see if m = 0 is the corresponding ˜ (Lemma 1) guarantees that the discrete logarithm will be hidden value in x to see whether x is an N-th residue or not. correctly calculated. Therefore, deciding whether m is a hidden value in Input Encoding y is exactly as hard as the DCR problem. Due to the User 1 conducts the following computation in the input same theory, given two known messages m0, m1 as well ˜ N ˜ mb encoding: as y = Yj (1 + N) , deciding which message is the

Q eij corresponding hidden value is exactly as hard as the DCR  e  i∈P,i6=2 C(xi ) 2j problem, which concludes the semantic security of Qij ∀j : Vj = EPK C(x2 ) against chosen-plaintext attack.  Y eij  = EPK C(xi ) Due to the semantic security of Qij , each user’s individ- i∈P ual polynomial value is statistically indistinguishable from  Y   Y  = E H(t )0 xeik = E xeik a uniform random element, and therefore any encoding key PK k ik PK ik (d) i∈P i∈P q (i) of any user i is indistinguishable from an element ∗ uniform randomly chosen from ZN˜ since it’s the sum of all The first two equations are true because of the additive individual polynomial values. homomorphism of the encryption EPK(·), and the third equation is true because: With the above lemmas, we are ready to present the proof for the following theorem. X (|P|−1) q (i)Li,P (0) = 0 mod N˜ (Lemma 5) i∈P Theorem 2. With the DDH assumption and DCR assumption, P q(|P|−1)(i)L (0) t 0 our PDA scheme is indistinguishable against chosen-plaintext ⇒ H(t) i∈P i,P = (h ) mod N (Lemma 3) attack (IND-CPA) under the random oracle model. Namely, for PDA Correctness of the aggregation algorithm is obvious. In any PPTA A, its advantage advA in the Data Analysis Game conclusion, after all Setup, KeyGen, Encode are executed, the is bounded as follows: output of Aggregate is f(xP ) with probability 1. 2 PDA e(qc + 1) DDH advA ≤ · advA qc 5.2 Security Proof where e is the base of the natural logarithm and qc is the number To prove the indistinguishability (IND-CPA) of our PDA, of queries to the encode oracle. we first prove the following lemmas. Proof. To prove the theorem, we present three relevant Lemma 6. In the KeyGen algorithm, Yi is statistically indistin- games Game 1, Game 2, and Game 3, in which we use A ∗ guishable from a uniformly random element chosen from ZN˜ . and B to denote the adversary and the challenger. For each l ∈ {1, 2, 3}, we denote El as the event that B outputs 1 in Proof. In the first secret-sharing phase, every user publishes l adv = |Pr[E ] − 1 | ri the Game , and we define l l 2 . yi =g ˜ where ri is randomly chosen, but since DDH Game 1: This is the game exactly identical to the Data problem is believed to be intractable in ˜ when the fac- ZN Publishing Game in Section 3.3. A’s encode queries (f(x), torization of N˜ is unknown, g˜ri is indistinguishable from Tf , {xik}i,k) are answered by returning the encode val- a uniformly random element chosen from N˜ as shown in Z {C(xik)}i,k. In the challenge phase, the adversary A [10]. ∗ selects a target polynomial f (x), the associated time win- (d) Lemma 7. In the KeyGen algorithm, any secret data point q (i) dow Tf ∗ , and two set of values xP,1, xP,2 which satisfy j ∗ ∗ remains statistically indistinguishable from a uniform randomly f (xP,1) = f (xP,2). Then, the challenger B returns the chosen element, and all the encoding keys are indistinguishable corresponding encoded values to the adversary A. When 0 from uniform randomly chosen elements. the game terminates, B outputs 1 if b = b and 0 otherwise. 1 PDA By the definition, adv1 = |Pr[E1] − 2 | = advA . Proof. In the EKi Query (Sub-Algorithm 1) phase, every user Game 2: This game occurs after the Game 1 terminates. ˜ (d) N ˜ qj (i) i except user j publishes Qij = Yj (1 + N) . We show In Game 2, the adversary A and the challenger B repeat the 8

same operations as in Game 1 using the same time windows Since H(t) maps t ∈ Z to a cyclic multiplicative group of those operations. However, for each encode query in hhi, we let H(t) = hx for some integer x. Let a = x and (|P|−1) a b c Game 1 at time t ∈ T1 ∪ T2, the challenger B flips a biased b = q (i)Li,P (0). Given a triple (h , h , h ), deciding c ab ∗ binary coin µTf for the entire time window Tf which takes whether h is h or a random element in ZN is exactly 1 with probability 1 and 0 with probability qc . When the k-DDH problem. Since s is a uniform random number, qc+1 qc+1

the Game 2 terminates, B checks whether any µTf = 1. If distinguishing between the above two terms is equivalent there is any, B outputs a random bit. Otherwise, B outputs to solving this k-DDH problem. However, in Game 3, the 1 if b0 = b and 0 if b0 6= b. If we denote F as the event that adversary is only given ha except hb, which means distin-

µTf = 1 for any Tf , the same analysis as in [21] shows that guishing Game 3 from Game 2 is at least as hard as the Pr[F ] = 1 . According to [40], Game 1 to Game 2 is a k-DDH problem. e(qc+1) transition based on a failure event of large probability, and Due to Lemma 2, the k-DDH problem is at least as adv1 therefore we have adv2 = adv1Pr[F ] = . hard as the DDH problem when k is known. However, e(qc+1) Game 3: In this game, the adversary A and the chal- k = φ(N)/N˜ is unknown to A, and therefore the k-DDH lenger B repeat the same operations as in Game 1 using in our group is also at least as hard as the DDH problem for the same time windows of those operations. However, A. there is a change in the answers to the encode queries In conclusion, distinguishing Game 3 from Game 2 is at (f(x),Tf , {xik}i,k). The oracle will respond to the query least as hard a DDH problem for A. with the following encoded values:

(|P|−1) 6 PERFORMANCE EVALUATION  eik q (i)Li,P (0) xik H(tk) µTf = 1 (|P|−1) 6.1 Theoretic Performance: Storage & Communication ∀i, ∀k : C(xik) =  q (i)Li,P (0) eik s xik H(tk) µTf = 0 Key Storage Complexity: We only require every user to (2) (n−1) keep a set of encoding keys EKi = {q (i), ··· , q (i)}. where s is a uniform randomly chosen element from ZN˜ Lagrange coefficients Li,P needed in Encode can be locally fixed for each polynomial f(x), i.e., the same polynomial computed when P of the analysis is declared. Therefore, the has the same s. When Game 3 terminates, B outputs 1 if storage complexity is O(n), where n is the total number of b0 = b and 0 otherwise. users. To the best of our knowledge, there is no sub-linear Due to the Lemma 8 below, distinguishing Game 3 solution under the same threat model and performance from Game 2 is at least as hard as a DDH problem for requirement (Section 8). any adversary A in Game 2 or Game 3. It follows then: DDH Communication Overhead: The communication overhead Pr[] − Pr[E3] ≤ adv . The answers to encode A for evaluating a certain polynomial f(xP ) in terms of queries in Game 3 are identical to those of Game 2 with transmitted bits is summarized in the Table 1. κ is the bit- probability 1 and different from those of Game 2 with 0 qc+1 length of the safe prime numbers used in our scheme, n probability qc . In the latter case, due to the random ele- qc+1 is the number of new users added into the system, m is ment s, C(xik) is uniformly distributed in the subgroup of the total number of product terms in the polynomial (Eq. ˜ eik order N, which completely blinds xik , and A can only ran- (1)), and θmin is the minimum threshold of the number of 0 domly guess b unless he knows the factorization N = pq. participants in the evaluation. Then, b0 = b with probability 1/2, and the total probability Pr[E2] qc Pr[E3] = q +1 + 2(q +1) . Then, we have: TABLE 1 c c Communication Overhead qc Pr[E2] − Pr[E3] = (Pr[E2] − 1/2) · Algorithm Send (bits) Receive (bits) q + 1 c Authority qc DDH =adv2 · ≤ advA Aggregate 0 O(mκ) qc + 1 Special User 1 Combining the above inequality with the advantages de- KeyGen when k users collude O(kn2κ) O(kn2κ) duced from Game 1 and Game 2, we finally have: Encode O(mκ) O(θminmκ) Adding n0 users O(n0n) O(n0n) qc PDA qc DDH adv2 · = advA · 2 ≤ advA Other participants including Special User 2 qc + 1 e(qc + 1) KeyGen when k users collude O(kn2κ) O(kn2κ) Encode O(mκ) 0 Adding n0 users O(n0n) O(n0n) Lemma 8. Distinguishing Game 3 from Game 2 is at least as hard as solving a DDH problem for A. User 1 with special role has to receive all other users’ Proof. The only difference between Game 2 and Game 3 is communication strings to generate the homomorphic ci- the oracle’s answers to the encode queries. When µTf = 1, phertexts, and his communication overhead is θmin times the answers in Game 3 are exactly same as the ones in Game greater. Besides, we only need two communication rounds

2. However, when the failure event F occurs (i.e., µTf = 0), for each polynomial evaluation, which is a constant irrel- s B answers using H(t) instead of H(t). Since the values xik evant to the number of participants. Specifically, everyone is submitted as a query by A, A knows the value of xik. except user 1 sends out his data in the first round, and Then, distinguishing Game 2 and Game 3 is equivalent to user 1 broadcasts the homomorphic ciphertext in the second distinguishing the following two terms: round. Finally, when κ = 512, the actual size of an en- (|P|−1) C(x ) (|P|−1) q (i)Li,P (0) coded value ik is 256B, and a homomorphic ciphertext q (i)Li,P (0)  s H(t) v.s. H(t) E(C(x2k)) is 512B. 9 6.2 Practical Performance: Computation Data Analysis via PDA To evaluate the computation overhead, we implemented our In this section, we obtain various datasets available online scheme in Ubuntu 12.04 using the GMP library (gmplib.org) (UCI machine learning datasets [4], SNAP Amazon Movie based on C in a computer with Intel i3-2365M CPU @ Review [49], SNAP Amazon Product Review [48]), and 2.80 GHz ×2, Memory 3GB. To exclude the communication conduct various analysis on them as described in Section overhead from the measurement, we generated all the com- 4.5 to show the generality of our framework as well as its munication strings as files in one computer and conducted performance. Note that the run times shown below are the all the computation locally (I/O overhead excluded in the overall run time when all the computations are conducted measurement). Security parameter κ is set as 512 bits to in a laptop computer, and they are not the run times per achieve strong prime numbers (the moduli N, N˜ are around user. The actual run time for non-special user is almost 1024 bits). Paillier’s cryptosystem is employed as the homo- negligible because the special users’ overhead contribute to morphic encryption in Encode, which is also implemented the majority of the run time as analyzed above. in C. Statistics Calculation: As aforementioned (Section 4.5), various statistics analysis are represented as polynomials, Microbenchmark which can be directly evaluated by our scheme. To compare We first perform a series of microbenchmarks to measure the the performance with peer works, we implemented three basic cost units of our scheme as in Table 2. Overhead of the schemes Shi [54], Joye [40], and Jung [41] in the same Encode and Decode algorithms depend on the polynomial environment, whose purpose is to calculate a product or to be evaluated, so we randomly chose a poylnomial with a sum on a private dataset. Due to this limitation, these single product term having 10,000 participants to measure schemes can be used to calculate simple statistic values the exact run time of each algorithm (Table 2). only (e.g., mean, standard deviation), and we compared the overall run time of the mean calculation on random values TABLE 2 as most of the statistics analysis supported by [40], [41], Basic Cost Units [54] are based on mean calculation. The entire calculation Algorithm Min Max Mean Std. Dev. is conducted on a laptop computer to measure the pure Authority computation overhead only, and the security parameter of Aggregate 0.25ms 0.31ms 0.28ms 0.01ms Special user 1 each scheme is chosen such that all schemes have the similar Encode 9.7ms 10.1ms 9.846ms 0.057ms (≈80-bit security level). Special user 2 Encode 9.4ms 9.6ms 9.458ms 0.053ms 4 0 0 S h i Other participants J o y e 3 5 0 J u n g Encode 0.115ms 0.145ms 0.129ms 0.0186ms O u r s 3 0 0 ) User 1’s and 2’s overhead is more expensive than other s 2 5 0 m ( ∗ e 2 0 0

users because they conduct operations in both N (1024 bits) m Z i t ∗ 2 n and ZN (2048 bits) while other users’ operations are closed u 1 5 0 ∗ R only in N . Besides, the aggregator’s overhead is cheaper Z 1 0 0 than user 1’s or 2’s because he only performs multiplications which are much cheaper than exponentiations. 5 0 0 As discussed in Section 7.4, User 1’s overhead is affected 0 1 0 2 0 3 0 4 0 5 0 by the number of actual participants in each product. Also, R e c o r d s # ( o r s i z e o f u s e r g r o u p ) it is clear that the overall overhead is related to the number Fig. 2. Comparisons of mean calculation of products in the polynomial. To evaluate the scalability w.r.t. those parameters, we execute the Encode algorithm As shown in Figure 2, the overhead of our scheme is and Aggregate algorithm for randomly chosen polynomials comparable to the one of Joye [40], and is twice of the with different parameters (Figure 1). ones of Shi [54] and Jung [41]. The main factor of this In Figure 1(a), x-axis is the minimum threshold, y-axis is gap is the bit-length of the moduli. When bit-security levels the number of product terms in a polynomial, and z-axis is are comparable to each other, the bit-lengths of the largest the run time. In Figure 1(b) and 1(c), x-axis is the number of moduli in Shi [54] and Jung [41] are as twice large as the product terms in a polynomial, and y-axis is the run time. ones in Joye [40]. Although our performance is not the Clearly, the run time of Encode grows bilinearly w.r.t. the best among these four schemes, other schemes can only let parameters for the special-role user 1, therefore, when either an aggregator obliviously evaluate the product or the sum parameter is large, one should employ a cloud server to act over a fixed dataset. Considering our scheme’s generality as the special user 1 since an ordinary user may not be able which enables an aggregator obliviously evaluate general to handle such huge computation. The run time of Encode multivariate polynomials over arbitrary subset of datasets grows linearly w.r.t. the number of product terms for other (with the support of new user arrival and old user removal), users (Figure 1(b)), and the same applies for Aggreate for we claim that this performance gap is acceptable. the aggregator. Note that the performance is more sensitive Regression Analysis: We trained a linear regression model to the number of products because it is equal to the number over the datasets in a privacy-preserving manner using ∗ of exponentiations in ZN 2 , which is much more expensive our scheme (Section 4.5), and measured the communication than the multiplications. overhead for each ordinary user as well as the overall run 10

R u n t i m e R u n t i m e R u n t i m e 0 . 0 2 5

0 . 0 2 0

2 5 0 0 . 0 2 0

2 0 0 0 . 0 1 5 0 . 0 1 5 ) ) s s

) m m

s ( 1 5 0 (

m e e 0 . 0 1 0

(

m 0 . 0 1 0 m

e i i t t

m

i n n t 1 0 0

u u

n R R u 0 . 0 0 5 R 0 . 0 0 5 5 0

0 0 . 0 0 0 0 . 0 0 0 0 2 0 2 0 0 1 5 4 0 0 1 0 # 6 0 0 r m s M i n 8 0 0 5 t t e 0 1 0 2 0 3 0 4 0 5 0 0 1 0 2 0 3 0 4 0 5 0 i m u m 0 d u c T h r e 1 0 0 0 P r o s h o l d P r o d u c t t e r m s # P r o d u c t t e r m s # (a) User 1’s Encode algorithm (b) Other users’ Encode (c) Aggregator’s Aggregation Fig. 1. Run time of different algorithms for different roles. time of the entire training in a local computer (Table 3). TABLE 5 Specifically, in SNAP datasets, we counted the number of SMO on Datasets ‘good’ strings (e.g., ‘I love’, ‘I like’, ‘best’) and ‘bad’ strings Datasets Records # Features # Time (e.g., ‘I hate’, ‘terrible’) in the review, and treated those Adult [43] 1,605 14 2.75h numbers as well as the price as the features. We further Web [12] 2,477 17 49.75h treated the score rating as the dependent variable.

TABLE 3 inevitable because 1) the run time of plain SMO algorithm Linear Regression on Datasets on the same dataset is also of several seconds, and 2) the Datasets Records # Feature # Comm. Time inherent complexity of the multiplications and exponentia- UCI Dataset tions over k-bit integers are O(κ2) and O(κ3) respectively, Adult 48,842 14 237KB 355s and the plain SVM without privacy consideration performs Bank 45,211 17 344KB 341s arithmetic operations in double type numbers (8B) while our Insurance 9,822 14 236KB 74s framework (with 80-bit security level) performs arithmetic White wine 4,898 11 148KB 33s Red wine 1,599 11 148KB 12s operations as well as exponentiations on 256B numbers. SNAP Dataset To the best of our knowledge, [44] is the first and only Arts 28K 3 13KB 210s work who presents a privacy-preserving method to train Books 13M 3 13KB 26h a SVM over a large number of individual records (most of Games 463K 3 13KB 0.9h similar works study how to train SVM on a small number Movie 8M 3 13KB 11h of vertically or horizontally partitioned database), but their work does not evaluate the performance in an real imple- Moreover, we compare our performance with a simi- mentation. Although we were not able to acquire the imple- lar work [51] which designed a privacy-preserving ridge mentation, we could indirectly compare our performance regression over millions of private data records (Table 4). with this only similar work. Their theoretic performance As we failed to obtain the source codes of their scheme, complexities (computation & communication) in theory are we perform the comparison based on the statistics provided exponentially greater than ours. Such a gap exists because in their paper. This is not a completely fair comparison of they relied on the secure multi-party computation. The the different computing environment. However, PDA sup- performance gap between the solutions based on secure ports any polynomial-based data analysis while [51] only multi-party computation and our work discussed in Section supports the ridge regression, therefore we claim that our 8.1 also tells the performance gap in real implementations is solution is more preferable if the performance of our general huge. Therefore, ours can be considered as a development solution is comparable to the ad hoc solution. to their solution which improved the performance aspect by getting rid of the secure multi-party computation. TABLE 4 Comparison with [51] Ridge Regression Insurance Red wine White wine 7 DISCUSSIONS &ANALYSIS [51] 55s 39s 45s 7.1 Disjoint Time Domains Ours 74s 12s 33s To be indistinguishable against CPA, our Encode must be SVM Analysis: We implemented the SMO algorithm [53] in randomized. However, it is not easy to let users achieve a privacy preserving manner (Section 4.5) to train a linear independent but coordinated random masks such that their support vector machine over the same datasets (part of UCI aggregation yields a constant if we do not rely on a trusted datasets) as those in [53]. Please refer to the paper for the center. We use the time slot as the ‘’ of our hash function detailed description of the dataset. H(·) to address this challenge. (|P|−1) q (i)Li,P ∗ Note that the run time to train a SVM is linear to the In the Encode algorithm, H(tk) ∈ ZN is number of iterations until convergence in the SMO algo- used to mask the private value xik. As analyzed in the rithm, which heavily depends on the characteristics of the appendix, such random masks are seemingly random under dataset. Our experiment shows that the SVM training via the random oracle model. Since the random mask is based PDA has a non-negligible overhead (Table 5), but this is on the hash function H(·) which is deterministic, we require 11 a disjoint time domain and forbid the re-use of any time slot 7.4 Performance Optimization tk. Same time slots yield same randomizers, and this will In the algorithm Encode, we require that everyone should allow some polynomial adversaries to have non-negligible submit his encoded value C(xik) even though his xik does advantages in our game. Q eik not appear in the product term xik (in which case xik is set as 1). This is because the number of actual partic- 7.2 Minimum Threshold for |P| ipating users should be at least a certain threshold θmin Since the security of PDA relies on the random masks whose to guarantee the semantic security (reviewed in details in product or sum is equal to an identity element, we require Section 7.5). However, in some polynomials, single product the minimum number of users in any polynomial evaluation terms involve very few users regardless of the total number should be θmin = 3. Otherwise, a user can trivially deduce of participants |P| of the polynomial, such as f(xP ) = P another user’s secret random mask r by inverting his own i,j∈P xixj . If strictly following Encode, one needs to have −1 random mask r (in multiplicative group). That is, besides |P| − 1 (encoded) 1’s multiplied for each product term xixj . the special user 1&2, we also need one ordinary user partic- However, to preserve the semantic security of the scheme, ipating in the evaluation. This is why we calculate the secret we only need θmin users to participate in each single key q(d)(i) in KeyGen algorithm only for d = 2, 3, ··· , n − 1 product term, and this is an unnecessary overhead. This (k) (recall that q (i) is used when k +1 users participate in the performance degradation becomes huge when θmin  |P|, evaluation). but we can simply get rid of it as follows. Whenever the number of participants involved in a product term is less 7.3 Passive Rushing & Collusion Attacks than θmin, randomly find dummy users in P only until we have θmin users instead of having everyone in P participat- In previous sections, we intentionally simplified the adver- ing in the analysis. Then, dummy users participate in the sarial model as well as the construction for the readability. evaluation by providing 1 as their private values, and by Consequently, the basic construction allows certain attacks. doing so, we only have minimum number of participants for In this section, we present a countermeasure to achieve a every product term, which greatly cut off the unnecessary more complete construction. overhead. In a passive rushing attack, attackers access the messages sent from others before they choose their random numbers [30]. Specifically, during the exchanges of Yj in the KeyGen 7.5 Limitations & Future Works −a algorithm, an adversarial user i−1 can send out yi−1 = yi+1 Performance Limit: It is shown in Section 6 that user 1’s ∗ to user i (a ∈ ZN˜ ) after receiving yi+1. Then, Pi’s random communication and computation overhead grows linearly −1 (ri+1−(ri+1−a))ri ari mask is equal to yi+1yi−1 =g ˜ =g ˜ , w.r.t. the actual participants in each product term, which which can be efficiently calculated with i’s public parameter may become huge in some big data analysis. In such case, ri yi =g ˜ . This attack does not tamper the final computation we may add a cloud server to take this role, and let him and cannot be detected. With exactly same theory, two provide 1 during the evaluation. Besides, each person needs colluding users i + 1 and i − 1 can also easily calculate i’s to send/receive O(n2κ) bits (Section 6.1) with small constant (ri+1−ri−1)ri secret parameter Yi =g ˜ . during the one-time operation KeyGen, which may be large However, we can adopt the ideas from [41]. With one when n is large. This can be reduced with several na¨ıve extra round of exchange, we can let each user i have ideas, for examples, dividing n users into several subgroups −1 ri+1ri Yi = (yi+2yi−1) , and a passive rushing attacker or and forcing the analysis within the subgroups only. But such two colluding users will fail to infer a user’s Yi. In gen- ideas disable some types of analysis, and handling this limit eral, to defend against k colluding passive rushing at- is one of our future works. tackers or k + 1 colluding users, we can have k more Polynomial Parameters: This paper focuses more on user’s extra rounds of exchanges to let each user i have Yi = data privacy, and pays less attention to the aggregator’s −1 ri+kri+k−1···ri+2ri+1ri yri+k+1 yri−1 . privacy. This is why we require that the polynomials should Recall that we initially assumed non-cooperative users be public, otherwise aggregators may adaptively design the in this paper. But with the above prevention, we can relax polynomial (e.g., f(xP ) = x1) to directly infer a user’s the assumption and make our PDA resilient to a collusion data. However, in some cases, the polynomial parameters attack of up to k − 1 passive rushing attackers or k normal may be proprietary property which also need to be pro- attackers. As a result, all encoding keys q(2)(i), ··· , q(k)(i) tected (e.g., data mining techniques), but then corresponding become unusable. This is because any user’s encoding key verification mechanisms should be introduced as well to EKi is composed of several data points in hidden polynomi- enable users to verify whether the polynomial is maliciously als, and k colluding users can deduce the k-degree hidden designed, which is one of our future works. polynomial q(k)(x) of 0 constant term. Then, these colluding Finite Real Numbers: Although our design supports poly- users can compute any user i’s encoding key q(k)(i). Thus, nomials over real numbers, the real numbers are represented the lowest degree of the usable polynomial is k + 1, and the via fixed-point representation, which constructs one-to-one minimum threshold θmin for |P| in the PDA becomes k + 2 mapping between the real numbers and the integers in the (each user uses q(k+1)(i) as the encoding key) to guarantee cyclic group we use as a part of crypto protocols. Therefore, the semantic security of PDA. Employment of this counter- the real numbers that can be expressed in our system is a measure translates to multiple extra communication rounds finite set of real numbers whose size is equal to the order of in KeyGen, but this is a one-time operation at the beginning the group in our algorithms (i.e., N). of the system initialization. Data Publishing: Our PDA only supports pull-based data 12 analysis in which the aggregator requests the data analysis trusted key dealer), but the accuracy is dependent on the with the declaration of the polynomial, and then users guaranteed security (i.e., accuracy increases when security publish their data according to it. One thought is to achieve is sacrificed), and the temporal correlation between the a push-based analysis, where users publish their data when- data leads to a limited utility when the data is time-series. ever it is generated, and the aggregator chooses desired Acs´ et al proposed a privacy-preserving smart metering subset of data and periods to perform the data analysis. system which support various aggregated statistics analysis However, once the data is published, it will not be under without leaking household activities. [1] presented how to the user’s control, and it becomes much more challenging perform Fan et al. [28] presented an adaptive approach to protect the privacy. Designing a push-based data analysis to time-series data releasing with differential privacy, but which can prevent malicious analysis on the already pub- their relative error is greater than 10−1 with certain privacy lished data is a very challenging task, and we leave this as sacrifice. Recently, Chen et al. [16] presented a scheme based our future work too. on perturbation to mine frequent patterns in the form of substring, but the utility for prefixes is poor due to the noises making the scheme differentially private. Supported 8 RELATED WORK by the recent demand in the accuracy [38], we claim that, The purpose of our scheme is to protect the data privacy for some applications in the future, it is not desirable to rely rather than the identity privacy. Therefore, we focus more on the approaches whose accuracy depends on the privacy on reviewing peer works studying the privacy-preserving requirements of the applications. data analysis whose purpose is to protect the data confiden- tiality. In general, related approaches can be categorized into 8.3 Functional Encryption the following families: approaches based on secure multi- party computation, perturbation, segmentation, functional Functional encryption [11] is a type of recently proposed encryption and secret-shared keys. public key encryption in which a certain function of the plaintext can be derived from the ciphertext if one possesses a valid private key corresponding to the function. It is pos- 8.1 Secure Multi-Party Computation sible to leverage FE to design a solution to our problem, but the one key is associated with a pre-defined function in FE, TABLE 7 and this key can be used to derive only one function of the Comparison of the fastest SMC implementation and ours plaintext. In many data mining algorithms, the actual tar- 5,550 AND via GMW [17] 134 additions via ours get analysis functions (classifiers or predictive models) are 15-20 seconds 0.06-0.08 seconds trained iteratively, in which each training at each iteration is another type of function to be evaluated over the private Secure Multi-party Computation (SMC) [57] enables n dataset. Since all such analysis are data-dependent, issued parties to jointly and privately compute a function fi(x) keys can hardly be re-used, and keys should be issued to where the result fi is returned to the participant i only. One all participants in advance of every data analysis, which is a can trivially use SMC to enable data analysis on published communicationally expensive interaction. Furthermore, the sanitized data, but SMC is subject to extraordinarily high key distribution must be performed via secured channels. delay in real life implementations. Among various real In contrast, our keys can be re-used arbitrarily many implementations on multi-party computation schemes (Fair- times as long as the time windows for every polynomial playMP [6], VIFF [23], SEPIA [13], and GMW [17]), even the are disjoint with each other. fastest GMW [17] does not scale well, whose communication rounds is linear to the number of users with huge constant 8.4 Secret-shared Keys factors, and it takes 15-20 seconds to conduct 5,500 AND gates evaluation. Similar-size computation in our scheme Several schemes distribute certain keys to the data users only incurs subsecond-level computation time and single- such that the sum or product of the keys are constants. Then, round communication of several milliseconds (Table 7). these keys are used to mask the original data. These schemes Such huge performance gap is due to the different purposes. are very close to ours. SMC targets at implementing arbitrary computation while Shi [54] and Joye [40] let the aggregator compute the n x ours target at polynomial-based data analysis only. Consid- sum of participants’ data i’s without learning anything ering the number of users and the volume of data, it is more else. They both let a trusted third party generate random s i P s = 0 practical to rely on schemes specially designed for general keys i for the -th participant such that i . Then, Q H(t)si = 1 H(·) analysis to enable lightweight protocols. it follows that , where is a hash function employed as a random oracle and t is the time slot when the calculation is carried out. Then, this property is used by both 8.2 Perturbation P schemes to compute the sum xi, and both of them are The perturbation is often leveraged to achieve the differen- proved to be semantically secure against chosen-plaintext tial or membership privacy ([27], [47]). In such approaches, attack. However, both of the schemes rely on a trusted key a random noise drawn from a known distribution is added issuer, and the data is not confidential against him. Further- to mask the original data, and some statistics about the more, only the product or sum calculation is enabled, which aggregate result can be deduced based on the distribution of makes it less suitable for a general data analysis. Jung [41], the random noises. These approaches enable certain analysis [42] removed the necessity of a trusted key dealer in their on published (sanitized) data (normally without requiring a settings, and their scheme enables both product and sum 13

TABLE 6 Comparisons Approach based on Storage Complexity Communication Rounds Accurate result? Rely on ? Multi-party Computation High O(n)∗ Yes No Perturbation O(1) O(1) No No Functional Encryption High O(1) Yes Yes Secret-shared Keys O(2n) O(1) Yes Yes, except [41], [42] Ours O(n) O(1) Yes No *This is the best known complexity, and every gate in the circuit incurs one round of oblivious transfer. calculations. They also mention the possibility of evaluating any subgroup of users’ private data without learning indi- some multivariate polynomials, but all the product terms viduals’ input values, and it also supports new users to join in the polynomials are revealed to the aggregator. More the system and old users to leave the system. Our formal importantly, the hardness of breaking their scheme reduces proof shows the semantic security of PDA against chosen- to computational diffie-hellman (CDH) problem only, which plaintext attacks, and the extensive evaluations show that makes their scheme one-way but not semantically secure. PDA incurs constant overhead regardless of the total num- These four schemes based on secret-shared keys are close ber of users or the complexity of the polynomial. Besides, to our solution, but all the schemes need a fixed group and PDA also exhibits various nice properties (design goals in assign a set of secret keys, which means for a set of n users, Section 1) which make it suitable for real life applications. one needs to let all users hold θ(2n) keys to conduct data analysis among any subgroup of them, which violates the REFERENCES flexibility requirement. [1]G. Acs´ and C. Castelluccia. I have a dream!(differentially private As a summary, the comparisons of all aforementioned smart metering). In Information Hiding, pages 118–132. Springer, approaches and ours are shown in the Table 6. 2011. [2] C. G. Akcora, B. Carminati, and E. Ferrari. Privacy in social networks: How risky is your social graph? In ICDE, pages 9–19. 8.5 Secure Multivariate Polynomial Evaluation IEEE, 2012. To the best of our knowledge, [9], [22], [31], [42] are the [3] M. Aliasgari, M. Blanton, Y. Zhang, and A. Steele. Secure compu- tation on floating point numbers. In NDSS, 2013. only four works who study the secure multi-party com- [4] K. Bache and M. Lichman. UCI machine learning repository, 2013. putation on multivariate polynomial evaluation. First of [5] M. Barbaro, T. Zeller, and S. Hansell. A face is exposed for aol all, the solution in [42] reveals all single product terms in searcher no. 4417749. New York Times, 9(2008):8For, 2006. the polynomial, and it leaks much information about the [6] A. Ben-David, N. Nisan, and B. Pinkas. Fairplaymp: a system for secure multi-party computation. In CCS, pages 257–266. ACM, private values. [31] focused on a polynomial of degree 3, and 2008. points out that this can be generalized to a higher degree. [7] J. Bennett and S. Lanning. The netflix prize. In Proceedings of KDD However, in higher degrees, the communication overhead cup and workshop, volume 2007, page 35, 2007. [8] M. J. Berry and G. Linoff. Data mining techniques: for marketing, is not optimal. Subsequently, [22] presented a more effi- 2 sales, and customer support. John Wiley & Sons, Inc., 1997. cient scheme with the unicast complexity O(κm n log d) [9] M. Blanton. Achieving full security in privacy-preserving data and broadcast complexity O(κm3(log d)2) (untight bound mining. In PASSAT and SocialCom, pages 925–934. IEEE, 2011. for simplicity), where κ is the security parameter, m is [10] D. Boneh. The decision diffie-hellman problem. In Algorithmic number theory, pages 48–63. Springer, 1998. the number of product terms in the polynomial, n is the [11] D. Boneh, A. Sahai, and B. Waters. Functional encryption: Defini- number of participants, and d is the highest degree of the tions and challenges. In TCC, pages 253–273. Springer, 2011. polynomial. Also, their communication rounds is a constant [12] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of in terms of the ‘round table’ rounds, which is linear to the predictive algorithms for collaborative filtering. In UAI, pages 43– 52. Morgan Kaufmann Publishers, 1998. number of users. In our framework, the unicast complexity [13] M. Burkhart, M. Strasser, D. Many, and X. Dimitropoulos. Sepia: for each evaluation is only O(θminmκ) for the special user Privacy-preserving aggregation of multi-domain network events Security (θmin being the minimum threshold for the number of and statistics. In . USENIX, 2010. [14] J. Camenisch and M. Michels. Proving in zero-knowledge that a participants) and O(mκ) for other participants, and there number is the product of two safe primes. In EUROCRYPT, pages is no broadcast overhead. Also, the communication round 107–122. Springer, 1999. is a constant irrelevant to the number of participants in [15] B. Carminati, E. Ferrari, and M. Guglielmi. Share: Secure infor- the evaluation. Besides, [9] also discusses how to compute mation sharing framework for emergency management. In ICDE, pages 1336–1339. IEEE, 2013. limited families of polynomials, but the rounds complexity [16] R. Chen, G. Acs, and C. Castelluccia. Differentially private sequen- is O(l log l) where l is the bit-length of the inputs. tial data publication via variable-length n-grams. In CCS, pages 638–649. ACM, 2012. [17] S. G. Choi, K.-W. Hwang, J. Katz, T. Malkin, and D. Rubenstein. 9 CONCLUSION Secure multi-party computation of boolean circuits with applica- tions to privacy in on-line marketplaces. In CT-RSA, pages 416– In this paper, we successfully designed PDA, a general 432. Springer, 2012. framework which enables a third-party aggregator to per- [18] S. S. Chow, J.-H. Lee, and L. Subramanian. Two-party computation form many types of polynomial-based data analysis on model for privacy-preserving queries over distributed databases. private data records, where the data records are possessed In NDSS, 2009. [19] E. K. Clemons, S. Barnett, and A. Appadurai. The future of adver- by millions of users. Namely, our framework enables the ag- tising and the value of social network websites: some preliminary gregator to evaluate various multivariate polynomials over examinations. In ICEC, pages 267–276. ACM, 2007. 14

[20] P. I. T. A. Committee et al. Revolutionizing health care through [48] J. McAuley and J. Leskovec. Hidden factors and hidden topics: information technology. Arlington, VA, EOP of the U.S., 2004. understanding rating dimensions with review text. In RecSys, [21] J.-S. Coron. On the exact security of full domain hash. In CRYPTO, pages 165–172. ACM, 2013. pages 229–235. Springer, 2000. [49] J. J. McAuley and J. Leskovec. From amateurs to connoisseurs: [22] D. Dachman-Soled, T. Malkin, M. Raykova, and M. Yung. Secure modeling the evolution of user expertise through online reviews. efficient multiparty computing of multivariate polynomials and In WWW, pages 897–908. IW3C2, 2013. applications. In ACNS, pages 130–146. Springer, 2011. [50] A. Mohaisen and Y. Kim. Dynamix: anonymity on dynamic social [23] I. Damgard,˚ M. Geisler, M. Krøigaard, and J. B. Nielsen. Asyn- structures. In ASIACCS, pages 167–172. ACM, 2013. chronous multiparty computation: Theory and implementation. [51] V. Nikolaenko, U. Weinsberg, S. Ioannidis, M. Joye, D. Boneh, In PKC, pages 160–179. Springer, 2009. and N. Taft. Privacy-preserving ridge regression on hundreds of [24] G. Danezis, C. Diaz, E. Kasper,¨ and C. Troncoso. The wisdom of millions of records. In S&P, pages 334–348. IEEE, 2013. crowds: attacks and optimal constructions. In ESORICS, pages [52] P. Paillier. Public-key based on composite degree 406–423. Springer, 2009. residuosity classes. In EUROCRYPT, pages 223–238. Springer, [25] G. Danezis, C. Fournet, M. Kohlweiss, and S. Zanella-Beguelin.´ 1999. Smart meter aggregation via secret-sharing. In SEGS, pages 75–80. [53] J. Platt et al. Sequential minimal optimization: A fast algorithm ACM, 2013. for training support vector machines. 1998. [26] G. Danezis and C. Troncoso. You cannot hide for long: de- [54] E. Shi, T.-H. H. Chan, E. G. Rieffel, R. Chow, and D. Song. Privacy- anonymization of real-world dynamic behaviour. In WPES, pages preserving aggregation of time-series data. In NDSS, 2011. 49–60. ACM, 2013. [55] http://stattrek.com/statistics/formulas.aspx. [27] C. Dwork. Differential privacy: A survey of results. In TAMC, [56] J. Xie, B. P. Knijnenburg, and H. Jin. Location sharing privacy pages 1–19. Springer, 2008. preference: analysis and personalized recommendation. In IUI, [28] L. Fan and L. Xiong. An adaptive approach to real-time aggregate pages 189–198. ACM, 2014. monitoring with differential privacy. In TKDE. IEEE, 2013. [57] A. C. Yao. Protocols for secure computations. In FOCS, pages [29] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani. 160–164, 1982. Person re-identification by symmetry-driven accumulation of local features. In CVPR, pages 2360–2367. IEEE, 2010. [30] J. Feigenbaum and M. Merritt. Distributed computing and cryptogra- phy: proceedings of a DIMACS Workshop, October 4-6, 1989, volume 2. Taeho Jung Taeho Jung received the B.E de- American Mathematical Soc., 1991. gree in Computer Software from Tsinghua Uni- [31] M. Franklin and P. Mohassel. Efficient and secure evaluation of versity, Beijing, in 2011, and he is working toward multivariate polynomials and applications. In ACNS, pages 236– the Ph.D degree in Computer Science at Illinois 254. Springer, 2010. Institute of Technology while supervised by Dr. [32] M. Franz, B. Deiseroth, K. Hamacher, S. Jha, S. Katzenbeisser, and Xiang-Yang Li. His research area, in general, in- H. Schroder.¨ Secure computations on non-integer values with cludes privacy & security issues in big data anal- applications to privacy-preserving sequence analysis. Information ysis and networking applications. Specifically, he Security Technical Report, 17(3):117–128, 2013. is currently working on the privacy-preserving [33] B. Fung, K. Wang, R. Chen, and P. S. Yu. Privacy-preserving data computation in various applications and scenar- publishing: A survey of recent developments. CSUR, 42(4):14, ios where big data is involved. 2010. [34] D. Gillin. The federal trade commission and internet privacy. Junze Han Junze Han is a Ph.D. student in Marketing Research, 12(3):39–41, 2000. Computer Science at Illinois Institute of Tech- [35] S. Goldwasser and M. Bellare. Lecture notes on cryptogra- nology since 2011. He received the B.E. degree phy. Summer course “ and computer security” at MIT, from the Department of Computer Science at 1999:1999, 1996. Nankai University, Tianjin, in 2011. His research [36] S. Goldwasser and S. Micali. Probabilistic encryption. JCSS, interests include data privacy, mobile computing 28(2):270–299, 1984. and wireless networking. [37] G. Hans. Privacy policies, terms of service, and ftc enforcement: Broadening unfairness regulation for a new era. 2012. [38] S. Ioannidis, A. Montanari, U. Weinsberg, S. Bhagat, N. Fawaz, and N. Taft. Privacy tradeoffs in predictive analytics. In SIGMETRICS. ACM, 2014. [39] A. R. Jadad, R. B. Haynes, D. Hunt, and G. P. Browman. The Xiangyang Li Dr. Xiang-Yang Li is a professor at internet and evidence-based decision-making: a needed synergy the Illinois Institute of Technology. He holds EMC for efficient knowledge management in health care. CMAJ, -Endowed Visiting Chair Professorship at Ts- 162(3):362–365, 2000. inghua University. He currently is a distinguished visiting professor at XiAn JiaoTong University, [40] M. Joye and B. Libert. A scalable scheme for privacy-preserving and University of Science and Technology of aggregation of time-series data. In FC, pages 111–125. Springer, China. He is a recipient of China NSF Outstand- 2013. ing Overseas Young Researcher (B). Dr. Li re- [41] T. Jung and X.-Y. Li. Collusion-tolerable privacy-preserving sum ceived MS (2000) and PhD (2001) degree at and product calculation without secure channel. In TDSC. IEEE, Department of Computer Science from Univer- 2014. sity of Illinois at Urbana-Champaign, a Bachelor [42] T. Jung, X. Mao, X.-y. Li, S.-J. Tang, W. Gong, and L. Zhang. degree at Department of Computer Science and a Bachelor degree at Privacy-preserving data aggregation without secure channel: mul- Department of Business Management from Tsinghua University, P.R. tivariate polynomial evaluation. In INFOCOM, pages 2634–2642. China, both in 1995. His research interests include wireless networking, IEEE, 2013. mobile computing, security and privacy, cyber physical systems, and [43] R. Kohavi. Scaling up the accuracy of naive-bayes classifiers: A algorithms. He and his students won three best paper awards (ACM decision-tree hybrid. In SIGKDD, pages 202–207, 1996. MobiCom 2014, COCOON 2001, IEEE HICSS 2001), one best demo [44] S. Laur, H. Lipmaa, and T. Mielikainen.¨ Cryptographically private award (ACM MobiCom 2012)and was selected as best paper candidates support vector machines. In SIGKDD, pages 618–624. ACM, 2006. twice (ACM MobiCom 2008, ACM MobiCom 2005). He published a [45] M. Li, S. Yu, Y. Zheng, K. Ren, and W. Lou. Scalable and secure monograph ”Wireless Ad Hoc and Sensor Networks: Theory and Appli- sharing of personal health records in cloud computing using cations”. He co-edited several books, including, ”Encyclopedia of Algo- attribute-based encryption. TPDS, 24(1):131–143, 2013. rithms”. Dr. Li is an editor of several journals, including IEEE Transaction [46] N. Li, T. Li, and S. Venkatasubramanian. Closeness: A new privacy on Mobile Computing. He has served many international conferences measure for data publishing. TKDE, 22(7):943–956, 2010. in various capacities, including ACM MobiCom, ACM MobiHoc, IEEE [47] N. Li, W. Qardaji, D. Su, Y. Wu, and W. Yang. Membership privacy: MASS. He is an IEEE Fellow and an ACM Distinguished Scientist. a unifying framework for privacy definitions. In CCS, pages 889– 900. ACM, 2013.