Private and Efficient Approximation Xianrui Meng Dimitrios Papadopoulos [email protected] [email protected] Amazon Web Services Hong Kong University of Science and Technology Alina Oprea Nikos Triandopoulos [email protected] [email protected] Northeastern University Stevens Institute of Technology

ABSTRACT especially so in , where model inference uses In collaborative learning, multiple parties contribute their datasets unlabelled observations (evading points of saturation, encountered to jointly deduce global models for numerous in , where new training sets improve accuracy predictive tasks. Despite its efficacy, this learning paradigm fails to only marginally). Also, the more varied the collected data, the more encompass critical application domains that involve highly sensi- elaborate its analysis, as degradation due to noise reduces and do- tive data, such as healthcare and security analytics, where privacy main coverage increases. Indeed, for a given learning objective, say risks limit entities to individually train models using only their own classification or , combining more datasets of datasets. In this work, we target privacy-preserving collaborative hi- similar type but different origin enables discovery of more complex, erarchical clustering. We introduce a formal security definition that interesting, hidden structures and of richer association rules (corre- aims to achieve balance between utility and privacy and present a lation or causality) among attributes. So, ML models improve their two-party protocol that provably satisfies it. We then extend our predictive power when they are built over multiple datasets owned protocol with: (i) an optimized version for single-linkage cluster- and contributed by different entities, in what is termed collaborative ing, and (ii) scalable approximation variants. We implement all our learning—and widely considered as the golden standard [100]. schemes and experimentally evaluate their performance and accu- Privacy-preserving hierarchical clustering. Several learning racy on synthetic and real datasets, obtaining very encouraging tasks of interest, across a variety of application domains, such as results. For example, end-to-end execution of our secure approxi- healthcare or security analytics, demand deriving accurate ML mod- mate protocol for over 1M 10-dimensional data samples requires els over highly sensitive data—e.g., personal, proprietary, customer, 35sec of computation and achieves 97.09% accuracy. or other types of data that induce liability risks. By default, since collaborative learning inherently implies some form of data sharing, CCS CONCEPTS entities in possession of such confidential datasets are left with no • Security and privacy → Cryptography; • Computing other option than simply running their own local models, severely methodologies → Unsupervised learning. impacting the efficacy of the learning task at hand. Thus, privacy risks are the main impediment to collaboratively learning richer KEYWORDS models over large volumes of varied, individually contributed, data. The security and ML community has embraced the concept of secure computation; private hierarchical clustering; secure approx- Privacy-preserving Collaborative Learning (PCL), the premise being imation that effective analytics over sensitive data is feasible by building global models in ways that protect privacy. This is closely related to 1 INTRODUCTION (privacy-preserving) ML-as-a-Service [42, 52, 53, 104] that utilizes Big-data analytics is an ubiquitous practice with a noticeable im- cloud providers for ML tasks, without parties revealing their sensi- pact on our lives. Our digital interactions produce massive amounts tive raw data (e.g., using encrypted or sanitized data. Existing work of data that are analyzed in order to discover unknown patterns on PCL mostly focuses on supervised rather than unsupervised arXiv:1904.04475v3 [cs.CR] 27 Sep 2021 or correlations, which help us draw safer conclusions or make in- learning tasks (with a few exceptions such as 푘-means clustering). formed decisions. At the core of this lies Machine Learning (ML) for As unsupervised learning is a prevalent paradigm, the design of devising complex data models and predictive algorithms that pro- ML protocols that promote collaboration and privacy is vital. vide hidden insights or automated actions, while optimizing certain In this paper, we study the problem of privacy-preserving hierar- objectives. Example applications that successfully employ ML are chical clustering. This unsupervised learning method groups data market forecast, service personalization, speech/face recognition, points into similarity clusters, using some well-defined distance autonomous driving, health diagnostics and security analytics. metric. The “hierarchic” part is because each data point starts as Of course, data analysis is only as good as the analyzed data, but a separate “singleton” cluster and clusters are iteratively merged this goes beyond the need to properly inspect, cleanse or transform building increasingly larger clusters. This process forms a natural high-fidelity data prior to its modeling: In most learning domains, hierarchy of clusters that is part of the output, showing how the analyzing “big data” is of twofold semantics: volume and variety. final clustering was produced. We present scalable cryptographic First, the larger the dataset available to an ML algorithm, the bet- protocols that allow two parties to privately learn a model for the ter its learning accuracy, as irregularities due to fade away joint clusters of their combined datasets. Importantly, we propose a faster. Indeed, scalability to large dataset sizes is very important, formal security definition for this task in the MPC framework and prove our protocols satisfy it. In contrast, prior works for privacy- hierarchical clustering already includes the (now partitioned) in- preserving hierarchical clustering have proposed crypto-assisted put, this problem cannot directly benefit from MPC. This issue is protocols but without offering rigorous security definitions or anal- partially the reason why previous approaches for hierarchical clus- ysis (e.g., [27, 55, 57]; see detailed discussion in Section 8). tering (see discussion in Section 8 and an excellent survey of related Motivating applications. Hierarchical clustering is a class of un- work by Hegde et al. [49]) lack formal security analysis or have supervised learning methods that build a hierarchy of clusters over significant information leakage. To overcome this, our approach is an input dataset, typically in bottom-up fashion. Clusters are initial- to modify and refine what private hierarchical clustering should ized to each contain a single input point and are iteratively merged produce, redacting the joint output—sufficiently enough, to allow in pairs, according to a linkage metric that measures clusters’ close- the needed input privacy protection, but minimally so, to preserve ness based on their contained points. Here, unlike other clustering the learning utility. We introduce a security notion that is based on methods (푘-means or spectral clustering), different distance metrics point-agnostic dendrograms, which explicitly capture only the merg- can define cluster linkage (e.g., nearest neighbor and diameter for ing history of formed joint clusters and useful statistics thereof, to single and complete linkage, respectively) and flexible conditions on balance the intended accuracy against the achieved privacy. To the these metrics can determine when merging ends. The final output best of our knowledge, our formal security definition (Section 3) is is a dendrogram with all formed clusters and their merging history. the first such attempt for the case of hierarchical clustering. This richer clustering type is widely used in practice, often in areas The next challenge is to securely realize this functionality ef- where the need for scalable PCL solutions is profound. ficiently. Standard tools for secure two-party computation, e.g., In healthcare, for instance, hierarchical clustering allows re- garbled circuits [113, 114], result in large communication, while searchers, clinicians and policy makers to process medical data fully homomorphic encryption [41] is still rather impractical, so and discover useful correlations to improve health practices—e.g., designing scalable hierarchical clustering PCL protocols is chal- discover similar genes types [34], patient profiles most in need of lenging. Moreover, hierarchical clustering of 푛 points is already 3 targeted intervention [80, 110] or changes in healthcare costs for computation-heavy—of O(푛 ) cost. As such, approximation algo- specific treatments [68]. To be of any predictive value, such data rithms, e.g., CURE [47], are the de facto means to scale to massive contains sensitive information (e.g., patient records, gene informa- datasets, but incorporating approximation to private computation tion, or PII) that must be protected, also due to legislations such as is not trivial—as complications often arise in defining security [37]. HIPPA in US or GDPR in EU. Also, in security analytics, hierarchi- In Section 4, we follow a modular design approach and use cryp- cal clustering allows enterprise security personnel to process log tography judiciously by devising our main construction as a mixed data on network/users activity to discover suspicious or malicious protocol (e.g., [28, 50, 63]). We decompose our refined hierarchical events—e.g., detect botnets [46], malicious traffic79 [ ], compromised clustering into building blocks and then we select a combination of accounts [19], or malware [13]. Again, such data contains sensi- tools that achieves fast computation and low bandwidth usage. In tive information (e.g., employee/customer data, enterprise security particular, we conveniently use garbled circuits for cluster merging, posture, defense practices, etc.) that must be protected, also due but additive homomorphic encryption [86] for cluster encoding, to industrial regulations or for reduced liability. As such, without while securely “connecting” the two steps’ outputs. privacy provisions for joint , entities are restricted In Section 5, we evaluate the performance and security of our 2 to learn only local clusters, thus confined in accuracy and effective- main protocol and present an optimized variant of 푂 (푛 ) cost for ness. E.g., a clinical-trial analysis over patients of one hospital may single linkage. In Section 6, we integrate the CURE method [47] introduce bias on geographic population, or network inspection of for approximate clustering into our design, to get the best-of-two- one enterprise may miss crucial insight from attacks against others. worlds quality of high scalability and privacy. We study different In contrast, our treatment of clustering as a PCL instance is a secure approximate variants that exhibit trade-offs between effi- solid step towards richer classification. Our protocols for private hi- ciency and accuracy without extra leakage due to approximation. erarchical clustering incentivize entities to contribute their private In Section 7, we report results from the experimental evaluation of datasets for joint cluster analysis over larger and more varied data our protocols on synthetic and real data that confirm their practical- collections, to get in return more refined results. For instance, hospi- ity. For example, end-to-end execution of our private approximate tals can jointly cluster medical data extracted from their combined single-linkage protocol for 1M 10 − 푑 records, achieves 97.09% ac- patient records, to provide better treatment, and enterprises can curacy at very with only 35sec of computation time. jointly cluster threat indicators collected from their combined SIEM Summary of contributions. Overall, in this work our results can tools, to present timely and stronger defenses against attacks.1 At be summarized as follows: all times, data owners protect the confidentiality of their private • We provide a formal definition and secure two-party protocols data and remain compliant with current regulations. for private hierarchical clustering for single or complete linkage. Challenges and insights. A first challenge we faced is how to • We present an optimized protocol for single linkage that signifi- rigorously specify the secure functionality that such protocols must cantly improves the computational and communication costs. achieve. A secure protocol guarantees that no party learns any- • We combine approximate clustering methods with our protocols thing about the input of the other party, except what can be inferred to get variants that achieve both scalability and strong privacy. after parties learn the output. But since the output dendrogram of • We experimentally evaluate the performance of our protocols via a prototype implementation over synthetic and real datasets. 1In line with current trends toward collaborative learning in healthcare/security analyt- ics; e.g., AI-based clinical-trial predictions [1], threat-intelligence sharing [7, 8, 29, 36]. 2 PRELIMINARIES Hierarchical Clustering Algorithm HCAlg D = { }푛 Hierarchical clustering (HC). For fixed positive integers 푑, 푙, let Input: Indexed set v푖 푖=1, termination parameter 푡 D = {푣 |푣 ∈ Z푑 }푛 be an unlabeled indexed dataset of 푛 푑- Output: Dendrogram 푇 , clusters 퐶 (푇 ), metadata 푀 (푇 ) 푖 푖 2푙 푖=1 Parameters: Linkage distance 훿 (·, ·), termination condition End(·, ·), dimensional points, where w.l.o.g, we set the domain to {0,..., 2푙 − cluster statistics set 푀 ⊇ {푟푒푝 (·), 푠푖푧푒 (·)} 1}. Over pairs 푥,푦 ∈ D of points, point distance is measured [Initially, at level 푛] using the standard square Euclidean distance metric dist(푥,푦) = Í푑 2 1. Initialize dendrogram 푇 : For each 푖 = 1, . . . , 푛: (푥 푗 − 푦푗 ) . Over pairs 푋, 푌 ⊆ D of sets of points, set closeness 푗=1 – Create node 푢푖 as the 푖th left-most leaf in 푇 . is measured using a linkage distance metric 훿 (푋, 푌), as a function of – Set 퐶 (푢푖 ) = {v푖 } as the singleton cluster of 푢푖 . the cross-set distances of points contained in 푋, 푌 . The most com- – Compute 푀 (푢푖 ) = {푚(v푖 ) |푚 ∈ 푀 } as statistics of 푢푖 . monly used linkage distances are the single linkage (or nearest neigh- 2. Set up linkages: Compute linkages of all pairs of bor) defined as 훿 (푋, 푌) = min푥 ∈푋,푦∈푌 dist(푥,푦), and the complete singleton clusters as a dictionary 퐷, where {퐶 (푢푖 ),퐶 (푢푗 )} linkage (or diameter) defined as 훿 (푋, 푌) = max푥 ∈푋,푦∈푌 dist(푥,푦). is keyed under 훿 (퐶 (푢푖 ),퐶 (푢푗 )), 1 ≤ 푖 < 푗 ≤ 푛. Standard agglomerative HC methods use set closeness to form [Iteratively, at level 푖 = 푛, . . . , ℓ푡 + 1] clusters in a bottom-up fashion, as described in algorithm HCAlg 1. Update 푇 : If 푁푖 is the set of nodes in 푇 at level 푖: ′ (Figure 1). It receives an 푛-point dataset D and groups its points – Find in 퐷 the min-linkage pair (푢,푢 ) of nodes in 푁푖 , into a total of ℓ푡 ≤ 푛 target clusters, by iteratively merging pairs breaking ties using a fixed rule over leaf-node indices. ′ of closest clusters into their union. The merging history is stored – Create node 푤 ∈ 푁푖−1 as parent of 푢 and 푢 ; set ′ ′ (redundantly) in a dendrogram 푇 , that is, a forest of clusters of 퐶 (푤) = 퐶 (푢) ∪ 퐶 (푢 ); for each node 푢¯ ∈ 푁푖 − {푢,푢 }, create node 푤¯ ∈ 푁 as parent of 푢¯; set 퐶 (푤¯ ) = 퐶 (푢¯). 푛 − ℓ푡 + 1 levels, where siblings correspond to merged clusters and 푖−1 ∈ ( ) levels to dataset partitions, build level-by-level as follows: – For each node 푤ˆ 푁푖−1, compute 푀 푤ˆ . 2. Check termination: If End(푇, 푡) == 1, terminate. • Initially, each input point 푣푖 ∈ D forms a singleton cluster {푣푖 } as a leaf in 푇 (at its lowest level 푛). 3. Update linkages: Compute linkage 훿 (퐶 (푤),퐶 (푤¯ )), for all 푤¯ ∈ 푁 − − {푤 }, and consistently update dictionary 퐷. • Iteratively, in 푛 − ℓ푡 clustering rounds, the 푖 root clusters (at top 푖 1 level 푖) form 푖 − 1 new root clusters in 푇 (at higher level 푖 − 1), Figure 1: Agglomerative hierarchical clustering. with the closest two merged into a union cluster as their parent, result 푦. In this context, we study privacy-preserving hierarchical and each other cluster copied to level 푖 − 1 as its parent. clustering in the semi-honest adversarial model which assumes that When a new level of ℓ target clusters is reached, HCAlg halts and 푡 parties are honest, but curious: They will follow the prescribed pro- outputs 푇 . The exact value of ℓ ∈ [1 : 푛] is determined during 푡 tocol but also seek to infer information about the input of the other execution via a predefined condition End checked over the current party, by examining the transcript of exchanged messages—the state 푇 and a termination parameter 푡 provided as additional input. latter, assumed to be transferred over a reliable channel. This allows for flexible termination conditions—e.g., stopping when Although, in practice, parties may choose to be malicious, devi- inter-cluster distance drops below an threshold specified by 푡, or ating from the prescribed protocol if they can benefit from this and simply when exactly ℓ = 푡 target clusters are formed. 푡 can avoid detection, the semi-honest adversarial model still has its Typically, the dendrogram 푇 is augmented to store some associ- merits, especially in the studied PCL setting. Namely, it provides ated cluster metadata, by keeping, after any union/copy cluster is essential privacy protection for any privacy-aware party to enter formed, some useful statistics over its contained points. Common the joint computation to benefit from collaborative learning. We such statistics for cluster 퐶 is its size 푠푖푧푒(퐶) = |퐶| and representa- note that, by trading off efficiency, security can be hardened via tive value 푟푒푝(퐶), usually defined as its centroid (i.e., a certain type known generic techniques for compiling protocols secure in this of average) point. Overall, for a set 푀 of cluster statistics of interest model into counterparts secure against malicious parties. and specified linkage distance and termination condition, HCAlg is D { }푛 Garbled circuits. One of the most widely used tools for two-party viewed to operate on indexed dataset = 푣푖 푖=1 and return an 푀-augmented dendrogram 푇 , comprised of: (1) the forest structure of secure computation, Garbled Circuits (GC) [113, 114] allow two dendrogram 푇 , specifying the full merging history of input points parties to evaluate a boolean circuit on their joint data without into formed clusters (from 푛 singletons to ℓ푡 target ones); (2) the revealing their respective inputs. This is done by generating an cluster set 퐶(푇 ); and (3) the metadata set 푀 (푇 ) associated with encrypted truth table for each gate while evaluating the circuit by (clusters in) 푇 . Assuming that HCAlg employs a fixed tie-breaking decrypting these tables in a way that preserves input privacy. In method in merging clusters, its execution is deterministic. Appendix A, we provide more details about the GC framework. Secure computation and threat model. We consider the stan- Homomorphic encryption. This technique allows carrying out dard setting for private two-party computation, where two parties operations over encrypted data. Fully Homomorphic Encryption wishing to evaluate function 푓 (·, ·) on their individual, private in- (FHE) [41] can evaluate arbitrary functions over ciphertexts, but puts 푥1, 푥2, engage in an interactive cryptographic protocol that remains rather impractical. Partially homomorphic encryption sup- upon termination returns to them the common output 푦 = 푓 (푥1, 푥2). ports only specific arithmetic operations over ciphertexts, but al- Protocol security has this semantics: Subject to certain computa- lows for very efficient implementations86 [ , 91]. We use Paillier’s tional assumptions and misbehavior types during protocol execu- scheme for Additively Homomorphic Encryption (AHE) [86], sum- tion, no party learns anything about the input of the other party, marized as follows. For security parameter 휆, keys generated by 휆 other than what can be inferred by its own input 푥푖 and the learned running (pk, sk) ← Gen(1 ) and a public RSA modulus 푁 , the ∗ (· ·) scheme encrypts (with public key pk) any message 푚 in the plain- Ideal Functionality 푓퐻퐶 , Z [ ] { }푛1 { }푛2 text space 푁 into a ciphertext 푚 , ensuring that decryption (with Input: Sets 푃 = 푝푖 1 , 푄 = 푞푗 1 ′ 2 secret key 푠푘) of any ciphertext product [푚]·[푚 ] mod 푁 (com- Output: Dendrogram 푇 ∗, metadata 푀∗ ⊇ {푟푒푝 (·), 푠푖푧푒 (·)} ′ putable without sk) results in the plaintext sum 푚 + 푚 mod 푁 . Parameters: Linkage distance 훿 (·, ·), termination condition End(·, 푡), Thus, decrypting [푚]푘 mod 푁 2 results in 푘푚 mod 푁 , and the cluster statistics set 푀, selection function S(·) ciphertext product [푚]·[0] results in a fresh encryption of 푚. [Pre-process] Form input of size 푛 = 푛1 + 푛2 for HCAlg: D = { }푛 = ≤ = 1. Set 푑푘 1 s.t. 푑푘 푝푘 , if 푘 푛1, or else 푑푘 푞푘−푛1 . 3 FORMAL PROBLEM SPECIFICATION 2. Pick random permutation 휋 : [푛] → [푛]; set D∗ = 휋 (D). We introduce a model for studying private hierarchical clustering, [HC-process] Run HCAlg(D∗, 푡) w/ parameters 훿, 푀, End. the first to provide formal specifications for secure two-party pro- [Post-process] Redact output 푇 ∗, 퐶 (푇 ∗), 푀 (푇 ∗) of HCAlg: tocols for this central PCL problem. Importantly, we define security 1. Set 푀∗ = ∅; ∀푣 ∈ 푇 ∗: if S(푣) == 1, 푀∗ ← 푀∗ ∪ {푀 (푣)}. for a refined learning task that achieves a meaningful balance be- 2. Return 푇 ∗, 푀∗. tween the intended accuracy and privacy—a necessary compromise ∗ for the problem at hand to even be defined as a PCL instance! Figure 2: Ideal functionality 푓퐻퐶 for hierarchical clustering. We first formulate two-party privacy-preserving hierarchical metadata 푀(푇 ). But set 퐶(푇 ) itself reveals the input D; in this case, clustering as a secure computation. Parties P1, P2 hold indepen- the price for collaborative learning is full disclosure of sensitive dently owned datasets 푃, 푄 of points in Z푑 , and wish to per- 2푙 data and nothing is to be protected! This raises the question of form a collaborative hierarchical clustering over the combined set limiting exactly what information about 푃, 푄 should be revealed by D = 푃 ∪ 푄 exact specification 푓 . They agree on the 퐻퐶 of this learn- 푓퐻퐶 which is the focus of the remainder of this section. ing task, as a function of their individually contributed datasets that encompasses all other parameters (e.g., for termination). Refined cluster analysis. In the PCL setting, we need a new defi- nition of hierarchical clustering that distills the full augmented den- Let Π be a two-party protocol that correctly realizes 푓퐻퐶 (·, ·): Run drogram {푇,퐶(푇 ), 푀(푇 )} into a redacted, but still useful, learned jointly on inputs 푥1, 푥2, Π returns the common output 푓퐻퐶 (푥1, 푥2). model, balancing between accuracy (to benefit from clustering) Thus, parties P1, P2 can learn cluster model 푓퐻퐶 (푃, 푄) by running protocol Π on their inputs 푃, 푄. As discussed, Π is considered to be and privacy (to allow collaboration). If allowing the ideal function- ( ) secure if its execution prevents an honest-but-curious party from ality 푓퐻퐶 to return 퐶 푇 is one extreme that diminishes privacy, learning anything about the other party’s input that is not implied removing the dendrogram 푇 from the output—to learn only about ( ) ( ) by the learned output. We formalize this intuitive privacy require- its associated information 퐶 푇 , 푀 푇 —is another that diminishes ment via the standard two-party ideal/real world paradigm [45]. accuracy. Indeed, if 푇 , which captures the full merging history in its structure, is excluded from the output of 푓 , a core feature in HC Ideal functionality. First, we define what one can best hope for. 퐻퐶 is lost: the ability to gain insights on how target clusters were formed, Cluster analysis with perfect privacy is trivial in an ideal world, under what hierarchies and in which order. This renders the HC where P , P instantly hand-in their inputs 푥 , 푥 to a trusted third 1 2 1 2 analysis only as good as much simpler techniques (e.g., 푘-means) party, called the ideal functionality 푓 , that computes and an- 퐻퐶 that merely discover pure similarity statistics of target clusters. nounces 푓 (푥 , 푥 ) (and explodes). Here, the use of terms “perfect” 퐻퐶 1 2 As the motivation for studying collaborative HC as a prominent and “ideal” is fully justified for no information about any private and widely used unsupervised learning task, in the first place, lies input is leaked during the computation. Some information about 푥 1 exactly on its ability to discover such rich inter-cluster relations, or 푥 may be inferred after the output is announced, by combining 2 we must keep the forest structure of 푇 in 푓 ’s output.3 the known 푥 or 푥 with the learned 푓 (푥 , 푥 ): It is the inherent 퐻퐶 2 1 퐻퐶 1 2 Avoiding the above two degenerate extremes suggests that the price for collaboratively learning a non-trivial function. learned model 푓 (푃, 푄) should necessarily include the cluster In the real world, P , P learn 푓 (푥 , 푥 ) by interacting in the 퐻퐶 1 2 퐻퐶 1 2 hierarchy 푇 but not the clusters 퐶(푇 ) themselves. Yet, the obvious joint execution of a protocol Π. We measure the privacy quality of middle-point approach of learning model 푓 푚 (푃, 푄) = {푇, 푀(푇 )} Π against the ideal-world perfect privacy, dictating that running 퐻퐶 remains suboptimal in terms of privacy protections, as the learned Π is effectively equivalent to calling the ideal functionality 푓 . 퐻퐶 output can still be strongly correlated to exact input points. Indeed, Informally, Π securely realizes 푓 , if anything computable by an 퐻퐶 given푇 and a party’s own input, inferring points of the other party’s efficient semi-honest party P in the real world, can be simulated i input simply amounts to identifying singleton clusters, which is by an efficient algorithm (called the simulator Sim), acting as P in i generally possible by inspecting and correlating the (hard-coded in the ideal world; i.e., Π leaks no information about a private input HCAlg) indices in D with the metadata associated to singletons (or during execution, subject to the price for learning 푓 (푥 , 푥 ). 퐻퐶 1 2 their close neighbors). For instance, if 푤 is the parent of singleton 푢 Next comes the question of which ideal functionality 푓퐻퐶 should ′ and cluster 푢 in 푇 , then P1 can infer input point 퐶(푢) of P2, either Π securely realize for private joint hierarchical clustering? Though directly from output 푀 (푢), if 푢 is known to store none of its input tempting, equating 푓 with the legacy algorithm HCAlg (Figure 1), 퐻퐶 points, or indirectly from 푀 (푢′), 푀 (푤), if these output values imply thus learning a full-form augmented dendrogram, slides us into a value of 푀(푢) that is consistent with none of its own inputs. a degeneracy. Assume 푓퐻퐶 merely runs HCAlg on the combined D = 푃 ∪푄 = {푑 }푛 푛 = |푃 | + |푄| 2 indexed set 푘 푘=1, . The learned model is the dendrogram 푇 along with its associated clusters 퐶(푇 ) and 3Cluster hierarchy is vital in HC learning, e.g., in healthcare, revealing useful causal factors that contribute to prevalence of diseases [34] and in biology, revealing useful 2If 푃, 푄 are indexed, then D = 푄 ∥푃, or else a fixed ordering is used. relationships among plants, animals and their habitat ecological subsystems [44]. Also, even without singleton clusters in the output, there is still input points. For instance, the applied permutation eliminates leak- leakage from the positioning of the points at the leaf level of 푇 . age from the positioning of the singleton cluster at the leaves that, E.g., assuming 푃, 푄 are ordered from left to right, a merging of two in our previous example, allowed one to infer whether the other points at the right half of the tree during the first merge reveals party owned points with smaller distance than its own pairs, from to P1 that P2 has a pair of points with smaller distance than the the first-round clustering result. More generally, anything inferable minimum distance observed among points in 푃. Hence, it is crucial about a party’s private input relates to a meta-analysis that must to eliminate information about the positioning of clusters in 푇 . necessarily encompass the (unknown) input distribution and the ∗ random permutation used by 푓퐻퐶 . This can be viewed as an inherent price of collaborative hierarchical clustering. The following defines Point-agnostic dendrogram. Such considerations naturally lead the security of privacy-preserving hierarchical clustering. to a new goal: We seek to refine further, but minimally so, the Definition 3.1. Π P P middle-point model 푓 푚 (푃, 푄) = {푇, 푀(푇 )} into an optimized A two-party protocol , jointly run by 1, 2 on 퐻퐶 respective inputs 푥 , 푥 using individual random tapes 푟 , 푟 that model 푓 ∗ (푃, 푄) = {푇 ∗, 푀∗ (푇 )}, whereby no private input points 1 2 1 2 퐻퐶 result in incoming-message transcripts 푡 , 푡 , is said to be secure directly leak to any of the parties, after the output is announced. This 1 2 for collaborative privacy-preserving hierarchical clustering in the quality is well-defined, intuitive and useful: Unless the intended presence of static, semi-honest adversaries, if it securely realizes joint hierarchical clustering explicitly copies some of input points the ideal functionality 푓 ∗ defined in Figure 2, by satisfying the to the output, the learned model 푓 ∗ (푃, 푄) should allow no party to 퐻퐶 퐻퐶 following: For 푖 = 1, 2 and for any security parameter 휆, there exists explicitly learn, that is, to deterministically deduce with certainty, a non-uniform probabilistic polynomial-time simulator Sim so any of the unknown input points of the other party. Pi 휆 ∗ ∗ Sim ( , 푥 , 푓 (푥 , 푥 )) view Π {푟 , 푡 } We accordingly set our ideal functionality 푓 for hierarchical that Pi 1 푖 퐻퐶 1 2  A ≜ 푖 푖 . 퐻퐶 Pi clustering to outputs a point-agnostic augmented dendrogram, de- fined by merely running algorithm HCAlg, subject to a twofold 4 MAIN CONSTRUCTION correction of its input 푃, 푄 and returned dendrogram (Figure 2): We now present our main construction, protocol PHC for Private • Pre-process input: Run HCAlg on indexed set D∗ that is a Hieararchical Clustering that securely realizes the ideal functional- random permutation over the combined set D = 푃 ∪푄 = {푑 }푛 . 푘 푘=1 ity 푓 ∗ (of Figure 2) when jointly run by parties P , P . • Post-process output: Return the output 푇 ∗, 퐶(푇 ∗), 푀(푇 ∗) of 퐻퐶 1 2 HCAlg redacted as 푇 ∗, 푀∗ (푇 ∗) ⊂ 푀(푇 ∗), including metadata of General approach. As discussed earlier, for efficiency reasons, only a few safe clusters in 푇 ∗. we seek to avoid carrying out hierarchical clustering—a complex ∗ Our ideal functionality 푓퐻퐶 refines the ordinary dendrogram 푇 , and inherently iterative process of cubic costs—in its entirety by 퐶(푇 ), 푀(푇 ): Running HCAlg on the randomly permuted input D∗ computing over ciphertext (e.g., via GC or FHE). Instead, we adopt (instead of D) results in a new randomized forest structure 푇 ∗ (in- a mixed-protocols design, decomposing hierarchical clustering into stead of푇 ) and, although its associated sets of formed clusters퐶(푇 ∗) more elementary tasks. We then use tailored secure and efficient and metadata 푀 (푇 ∗) remain the same, the learned model includes protocols for each task, and combine these components into a final no elements from 퐶(푇 ∗), but only specific elements from 푀(푇 ∗), protocol, in ways that minimize the cost in converting data encoding determined by a selection function S(·) (as a parameter agreed upon between individual sub-protocols. Hence, our solution is a secure ∗ among the parties and hard-coded in 푓퐻퐶 ). Such metadata is safe mixed-protocol specifically tailored for hierarchical clustering. to learn, in the sense that it does not directly leak any input points. It is worth noting that generic solutions from 2-party compu- We propose the following two orthogonal strategies for safe tation (2PC) (e.g., [28]), would solve the problem but would not metadata selection for point-agnostic dendrograms: easily scale to large datasets. During hierarchical clustering, we • 푠-Merging selection: 푀 (푤) ∈ 푀 (푇 ∗) if 푤 is the parent of 푢, 푢′ need to maintain a distance matrix between two parties with space in 푇 ∗ and |퐶(푢)|, |퐶(푢′)| > 푠: any non-singleton cluster formed complexity 푂 (푛2). If one relies solely on a single generic approach by merging two clusters of size above threshold 푠 > 0, is safe; such as GC or secret sharing, the communication bandwidth would • Target selection: 푀 (푤) ∈ 푀 (푇 ∗) if 푤 is root in 푇 ∗: any target become the bottleneck. Hence, using additively homomorphic en- ∗ cluster at level ℓ푡 in 푇 is safe. cryption during our protocol’s setup phase in order to produce a Above, the first strategy ensures that no direct leakage of private “shared permuted” distance matrix allows us not only to hide the input points occurs by correlating statistics of thin neighboring correspondence between euclidean distances and original points, clusters; in particular, no cluster statistics are learned for singletons but also to be more communication efficient eventually. Another ad- or their parents (푠 = 1), thus eliminating the type of leakage allowed vantage compared to other 2PC techniques is that our approach can by model 푓 푚 (푃, 푄). The second strategy ensures that only statistics achieve better precision as we explain in more detail in Section 7. 퐻퐶 ∗ of target clusters are learned, that is, input points may be directly Our protocol securely implements 푓퐻퐶 for the configuration that learned only explicitly as part of the intended cluster analysis. the parties specify: linkage 훿 (·, ·), termination condition End(·, 푡), Overall, the resulting dendrogram is point-agnostic in the sense cluster statistics set 푀, selection function S(·). Yet for simplicity, that neither the forest structure of 푇 ∗ nor the metadata 푀 (푇 ∗) re- hereby, we use the following default configurations, where: veal which singletons a party’s points are mapped to. As points are (1) complete linkage over one-dimensional data is used; randomly mapped to singletons, ties in cluster merging are ran- (2) the termination condition results in 푡 target nodes; domly broken, and no statistics are learned for singleton (or thin) (3) target selection is used for safe metadata selection; and clusters, no party can deduce with certainty any of the other party’s (4) only representative values and size statistics are learned. 휆, 휅 security and statistical parameters Algorithm 1: PHC: Private Cluster Analysis 푡, ℓ푡 termination parameter, # of target clusters ′ ′ P ’s Input: 푃 = {푝 , . . . , 푝 }, security parameter 휆 (pk, sk), (pk , sk ) public and secret keys of parties P1, P2 1 1 푛1 ′ P ’s Input: 푄 = {푞 , . . . , 푞 }, security parameter 휆 [푑 ], 푑 AHE-encrypted plaintext 푑 under pk, pk 2 1 푛2 J K Output: Merging history, {푟푒푝 (·), 푠푖푧푒 (·)} of 푡 target nodes ⟨푐 ⟩ AHE-decrypted ciphertext 푐 Parameters: Default configurations Σ 푛 × 푛 matrices 푅, 퐵 stored by P1, P2 (푂1;푂2) ← GC(퐼1; 퐼2) P1, P2 run GC on 퐼1, 퐼2 to get 푂1, 푂2 휆 1 P1: Generate (pk, sk) ← Gen(1 ); send pk to P2 휋1, 휋2 permutations contributed by P1, P2 ′ ′ 휆 ′ dist(푝, 푞) square Euclidean distance of 푝, 푞 2 P2: Generate (pk , sk ) ← Gen(1 ); send pk to P1 푟푒푝 (·), 푠푖푧푒 (·) representatives and sizes of clusters 3 P1, P2: Jointly run PHC.Setup, PHC.Cluster, PHC.Output Figure 4: Basic notation used in our protocol PHC.

P1, P2 process their individual states 푅, 퐵 to iteratively merge sin- gletons into target clusters, based on inter-cluster distances in Δ. Each iteration merges two clusters into a new one via three tasks: • Find pair: First, P1, P2 find the closest-cluster pair (푖, 푗) = argMin(Δ), 푖 < 푗, to merge, i.e., the indices in Δ of the mini- mum inter-cluster distance 퐷푖 푗 . ′ ′ ′ • Update linkages: Then, P1, P2 update Δ to Δ = 퐵 − 푅 with the new cluster distances after pair (푖, 푗) is merged into cluster 퐶 = 퐶푖 ∪ 퐶푗 . This entails computing (and splitting via a fresh blinding term) distance 훿 (퐶,퐶′) between 퐶 and each not-merged ′ ′ cluster 퐶 , which equals to the largest.smallest) of 훿 (퐶푖,퐶 ) and ′ 훿 (퐶푗 ,퐶 ) (by associativity of the max/min operator). ′ • Record merging: Finally, P1, P2 record in Δ that the new clus- ter 퐶 is formed by merging 퐶푖 and 퐶푗 . In an output phase, sub-protocol PHC.Output processes the final state {퐼, Δ} to compute the merging history and metadata for all Figure 3: Overall workflow of our protocol PHC. safe (target) clusters. As Figure 3 indicates, conceptually the output That is, by (2) - (4) in what follows (and in our experiments in can be considered to be computed in two phases: During clustering, Section 7), the set of redacted statistics 푀∗ consists of the repre- the indices of merged clusters learned after each clustering round collectively encode information about the dendrogram 푇 ∗ and the sentatives 푟푒푝1, . . . , 푟푒푝ℓ푡 and sizes 푠푖푧푒1, . . . , 푠푖푧푒ℓ푡 of ℓ푡 = 푡 target clusters (recall that representatives are a predefined type of centroid sizes of the 푡 target clusters. The output phase solely computes of the cluster, e.g., average or median), where 푡 is fixed in advance. the representative values of these clusters. This view is accurate Configuration 1) is used only for clarity; we discuss optimizations enough to ease presentation but, as we discuss later, the exact details for single linkage and extensions to higher dimensions in Section 5 involve processing of carefully recorded data, after each one of the (and we report on the evaluation of such extensions in Section 7). 푛 − 푡 cluster-merging rounds executed during the clustering phase. A main consideration when devising our protocol was to im- prove efficiency via a modular design, where separate parts can Protocol overview. After choosing configurations, P1, P2 run pro- be securely achieved via different techniques. By securely splitting tocol PHC (Algorithm 1), with inputs their datasets 푃, 푄 of 푛1, 푛2 the joint state {퐼, Δ} into {퐿, 푅}, 퐵, we can implement all protocol points, 푛 = 푛1 + 푛2, security parameter 휆, and statistical parameter 휅. Each party establishes its individual Paillier key-pair, and then components that involve (distance or metadata) computations over parties exchange their corresponding public keys. points using Paillier-based AHE, except when computing max (or Then, parties run sub-protocols PHC.Setup, PHC.Cluster and min), for which we rely on GC. Conveniently, all protocol compo- PHC.Output, which comprise the three main phases in our protocol, nents required by the setup phase to form the joint state {퐼, Δ}, ∗ namely to construct, shuffle and split {퐼, Δ} into {퐿, 푅}, 퐵, can be in direct analogy to the three components of 푓퐻퐶 . The general flow of our protocol is described below, in reference to also Figure 3. securely implemented by relying on homomorphic encryption. In a setup phase, sub-protocol PHC.Setup processes the 푛 input We next provide more details on how each component is im- points, viewed as an input array 퐼, and all pairwise distances among plemented. We assume points are unambiguously mapped into these points, viewed as a 푛 × 푛 cluster distance matrix Δ. Here, 퐼, Δ integers in Z푁 and all homomorphic (resp. plaintext) operations are reduced modulo 푁 2 (resp. 푁 ). We consistently denote the AHE- are only virtual, corresponding to an early joint state of P1, P2 that ′ is actually secret-shared between them. Specifically, P holds an encrypted, under pk (resp. pk ), plaintext 푑 by [푑] (resp. 푑 ) and 1 J K array 퐿 with exactly 퐼’s elements but each AHE-encrypted under the AHE-decrypted, under any key, ciphertext 푐 by ⟨푐⟩. Whenever the context is clear, we denote each of the two 푛 × 푛 matrices 푅, P2’s secret key, and a 푛 × 푛 matrix 푅 with random blinding terms, 퐵 (maintained by P , P ) by Σ. Finally, we denote the joint execu- whereas P2 holds the matrix 퐵 = Δ + 푅 with blinded pairwise cluster 1 2 ∗ { } tion by P1, P2 of a GC-based protocol GC, on private inputs 퐼1, 퐼2 distances. Importantly, as 푓퐻퐶 specifies, the joint state 퐼, Δ is split only after 퐼’s elements and Δ’s rows and columns are randomly to get private outputs 푂1, 푂2, by (푂1;푂2) ← GC(퐼1; 퐼2). Figure 4 summarizes the used notation by our detailed protocol descriptions. shuffled, with P1, P2 not knowing the exact shuffling used. In a clustering phase, sub-protocol PHC.Cluster virtually runs Setup phase. P1, P2 set up their states in three rounds of interac- the ordinary hierarchical clustering algorithm HCAlg on matrix Δ: tions, as shown in Algorithm 2, using only homomorphic operations over AHE-encrypted data and contributing equally to the random- Algorithm 2: PHC.Setup: Setup Phase ized state permutation and splitting. Initially, P prepares, encrypts 2 1 P2:%Create & send helper info 2 under its own key and sends to P1, information related to its in- 2 Compute matrix H: 퐻1,푖 = 푞푖 , 퐻2,푖 = −2푞푖 , 퐻3,푖 = 푞푖 푖 ∈ [1 : 푛2 ] J K J K J K put set 푄, which includes its encrypted points among other helper 3 Compute matrix D: 퐷푖,푗 = dist(푞푖, 푞푗 ) 푖, 푗 ∈ [1 : 푛2 ] J K information 퐻, and their encrypted pairwise distances 퐷 (lines 1-4). 4 Send {H, D} to P1 5 P1:%Blind points, linkages Then, P1 is tasked to initialize the states 퐿, 푅 and 퐵. First, the $ 휅 list 퐿 of all encrypted (under pk′) points in 푃 ∪ 푄 is created (by 6 Compute array S: 푆푖 = 푠푖 , 푠푖 ←−{0, 1} 푖 ∈ [1 : 푛] 7 Compute array L: 퐿푖 = 푝푖 , if 푖 ≤ 푛1; else 퐿푖 = 퐻1,푖−푛 arranging the sets in some fixed ordering and then concatenating 푄 J K 1 8 Blind L as: 퐿푖 := 퐿푖 · 푆푖 J K after 푃), and all points are further blinded by random additive terms 9 Encrypt S as: 푆푖 := [푆푖 ] in 푆 (lines 5-9). Similarly, the matrix 퐵 of encrypted (also under ←−{$ }휅 ∈ [ ] ′ 10 Compute matrix R: 푅푖,푗 = 푟푖,푗 , 푟푖,푗 0, 1 푖, 푗 1 : 푛 pk ) pairwise distances is computed (using the ordering induced 11 Compute matrix B: 퐵푖,푗 = dist(푝푖, 푝 푗 ) , if 푖, 푗 ≤ 푛1; J K 2 푝푖 by 퐿 to arrange the points), and all distances are blinded by ran- 12 퐵푖,푗 = 퐷푖−푛 ,푗−푛 , if 푛1 < 푖, 푗; else for 푖 < 푗, 퐵푖,푗 = 푝푖 · 퐻 ,푗 · 퐻3,푗 1 1 J K 2 dom additive terms in 푆 (lines 10-14). The computation of square 13 Blind B as: 퐵푖 푗 := 퐵푖 푗 · 푅푖,푗 J K Euclidean distances across sets 푃, 푄 (line 12, using also elements 14 Encrypt R as: 푅푖,푗 := [푅푖,푗 ] in 퐻) and the blinding of 퐿 and 퐵 (lines 8, 13) are all performed 15 Permute S, L, R and B via a random permutation 휋1 (푛) 16 Send {S, L, R, B} to P %Send permuted blinded data in the ciphertext domain via the homomorphic property of AHE 2 17 P2:%Blind received data encryption. All blinding terms in 푆 and 푅 are then encrypted (each ′ ′ ′ ′ $ 휅 18 Compute array S : 푆푖 = 푠푖 , 푠푖 ←−{0, 1} 푖 ∈ [1 : 푛] under 푝푘, lines 9, 14) and 푆, 퐿, 푅 and 퐵 are sent to P2, after their ′ 19 Blind L as: 퐿푖 := 퐿푖 · 푆푖 J ′ K elements are shuffled using a random permutation 휋1 (line 15). 20 Blind S as: 푆푖 := 푆푖 ·[푆푖 ] Finally, P2 roughly mirrors this by further blinding the encrypted ′ ′ ′ ′ $ 휅 21 Compute matrix R : 푅푖,푗 = 푟푖,푗 , 푟푖,푗 ←−{0, 1} 푖, 푗 ∈ [1 : 푛] points in 퐿 and P ’s encrypted terms in 푆 by random additive terms ′ 1 22 Decrypt and re-blind matrix B: 퐵푖,푗 = ⟨퐵푖,푗 ⟩ + 푅푖,푗 ′ ′ in 푆 (both in the ciphertext domain, lines 16-19) and also blinding 23 Blind R as: 푅푖,푗 := 푅푖,푗 ·[푅푖,푗 ] the encrypted distances in 퐵 and P1’s encrypted terms in 푅 by 24 Permute S, L and R via a random permutation 휋2 (푛) ′ random additive terms in 푅 (the former in the plaintext domain 25 Send {S, L, R} to P1 %Send permuted points and blinding terms and the latter in the ciphertext domain, lines 20-22). The freshly 26 P1:%Store permuted points & linkages’ blinding terms 27 Decrypt S as: 푆푖 := ⟨푆푖 ⟩ 푖 ∈ [1 : 푛] blinded 푆, 퐿, 푅 are sent to P , after their elements are shuffled using a −1 1 28 Unblind L as: 퐿푖 := 퐿푖 · 푆푖 J K random permutation 휋2 (line 23). Finally, P1 decrypts the mutually- 29 Decrypt R as: 푅푖,푗 := ⟨푅푖,푗 ⟩ 푖, 푗 ∈ [1 : 푛] contributed blinding terms in 푆 and 푅, and uses the recovered values in 푆 to completely remove the terms from 퐿 (in the ciphertext Algorithm 3: PHC.Cluster: Clustering Phase domain, by the properties of AHE encryption, lines 24-27). Due to 1 P1, P2: this, permutation 휋2 ◦ 휋1 looks completely random to both parties, 2 Initialize merging history: Σ푖,푖 = (푖, ⊥) 푖 ∈ [1 : 푛] while they have securely split joint state {퐼, Δ} into {퐿, 푅}, 퐵. 3 Initialize: ℓ = 1, ℓ푡 = 푡 4 repeat 5 Jointly run (푖, 푗;푖, 푗) ← ArgMin(R; B), 푖 < 푗 %Find pair 6 foreach 푘 = 1, . . . , 푛, 푘 ≠ 푖, 푗 do Clustering phase. Once P1, P2 have set up their states, they run 7 if Σ푖,푘 ≠⊥ and Σ푗,푘 ≠⊥ then $ 휅 the hierarchical clustering iterative process ( Algorithm 3) operating 8 P1: 푋 ←−{0, 1} %Pick new blinding term solely on their individual matrices 푅, 퐵 via two special-purpose 9 P1, P2: Jointly run %Find re-blinded max linkage (⊥ ) ← ( ) GC-based protocols for secure comparison. Importantly, each party ;푌 MaxDist 푅푖,푘 , 푅푗,푘 , 푋 ; 퐵푖,푘 , 퐵푗,푘 10 P1: Set: 푅푖푘 = 푋 , 푅푘푖 = 푋 %Update linkages encodes cluster information in the diagonal of its matrix state Σ; 11 P2: Set: 퐵푖푘 = 푌 , 퐵푘푖 = 푌 initially, the 푖-th entry stores (푖, ⊥), denoting the (never-merged 12 Set: Σ푗,푗 := ((Σ푗,푗 , ℓ), 푖), Σ푖,푖 := ((Σ푖,푖, Σ푗,푗 , ℓ), ⊥) but already permuted) singleton of rank 푖. Hierarchical clustering 13 Set: Σ푘,푗 =⊥, Σ푗,푘 =⊥ 푘 ∈ [1 : 푛]\{푗 } 14 Set: ℓ := ℓ + 1 %Record merging runs in exactly 푛 − ℓ푡 = 푛 − 푡 iterations, or clustering rounds. 15 until ℓ > 푛 − ℓ푡 ; First, at the start of each iteration, P1, P2 find which pair of clus- ters must be merged by jointly running the GC-protocol (푖, 푗;푖, 푗) ← are needed for comparing the linkages 퐵푖,푘 − 푅푖,푘 , 퐵푗,푘 − 푅푗,푘 be- ArgMin(푅; 퐵) (line 5): The parties contribute their individual ma- tween cluster 푘 and clusters 푖, 푗, and P2 learns the maximum value trices 푅, 퐵 of blinding terms and blinded linkages, to learn the of the two but blinded by the random blinding term 푋 inputted indices (푖, 푗) of the minimum value 퐵 − 푅 , with 푖 < 푗 by 푖,푗 푖,푗 by P1. The garbled circuit for MaxDist simply returns to (only) P2 convention (since 푅, 퐵 are symmetric matrices). The garbled cir- the value max{퐵푖,푘 − 푅푖,푘, 퐵푗,푘 − 푅푗,푘 } + 푋. Finally, at the end of cuit for ArgMin first removes the blinding terms by computing iteration ℓ, P1, P2 record information about the merging of clusters 퐷 = 퐵 − 푅, compares all values in 퐷 to find the minimum element 푖, 푗, 푖 < 푗 (lines 14-15): By convention, the new cluster is stored 퐷푖,푗 = min푥,푦퐵푥,푦, and returns to both parties the indices 푖, 푗. Next, at location 푖, by adding the rank ℓ and the information stored at once pair (푖, 푗) is known to P1, P2, they proceed to jointly update location 푗 (updated with a pointer to 푖), and by deleting all distances the linkages (lines 7-12). For each cluster 푘 in Σ, they change its related to cluster 푗. Overall, the full merging history is recorded. linkage to the newly merged cluster as the maximum between In Appendix B, we provide details on our implementation of its linkages to clusters 푖, 푗, by jointly running the GC-protocol GC-protocols ArgMin, MaxDist, also used in [11, 16, 65, 120]. (⊥;푌) ← MaxDist(푅푖,푘, 푅푗,푘, 푋; 퐵푖,푘, 퐵푗,푘 ) (line 10): The parties contribute the two entries from their individual matrices 푅, 퐵 that ′ Algorithm 4: PHC.Output: Output Phase clusters and 퐶 : Line 10 in Algorithm 3 now has P1, P2 jointly run GC-protocol (⊥;푌) ← MinDist(푅 , 푅 , 푋; 퐵 , 퐵 ) (see Appen- 1 P1:%Compute encrypted point averages 푖,푘 푗,푘 푖,푘 푗,푘 2 Initialize arrays E, J: 퐸푖 = 퐽푖 =⊥ 푖 ∈ [1 : 푛] dix B) to split the new distance Δ푖,푘 = min{퐵푖,푘 − 푅푖,푘, 퐵푗,푘 − 푅푗,푘 } 3 = foreach 푖 1, . . . , 푛 do into 푋, 푌 = Δ푖,푘 + 푋, without asymptotic efficiency changes. 4 if 푅푖,푖 encodes a target cluster 퐶푖 then More generally, the skeleton of protocol PHC allows for exten- 5 Find the index set 퐼푖 of points in cluster 퐶푖 6 = Î = | | sions that support a wider class of linkage functions, such as average Set 퐸푖 푗∈퐼푖 퐿푗 , 퐽푖 퐼푖 7 Send {E, J} to P2 or centroid linkage, by appropriately refining GC-protocols ArgMin, 8 P2:%Compute point averages MinDist—but still, at quadratic cost per merged cluster and cubic to- 9 Decrypt E as: 퐸푖 := ⟨퐸푖 ⟩ 푖 ∈ [1 : 푛] tal cost. Yet, our single-linkage protocol can be optimized to process 10 Send E to P 1 each new cluster in only 푂 (휅푛) time, for a reduced 푂 (휅푛2) total run- 11 P1, P2:%Return output 12 Output {Σ푖,푖, 퐸푖 /|퐽푖 |, |퐽푖 |} 푖 ∈ [1 : 푛] ning time, with GC-protocol (푗; 푗) ← ArgMin(푋;푌) now refined, on input arrays 푋, 푌, to return as common output the minimum- Output phase. Once clustering is over, P1, P2 compute in two value index 푗 of 푌 − 푋, excluding any non-linkage values. rounds of interaction (Algorithm 4) the common output, consisting The main idea is to exploit the associativity of operator min and of the merging history and the representatives and sizes of the that single-linkage clustering only relates to minimum inter-cluster target clusters using homomorphic operations over encrypted data. distances, to find the closest pair (푖, 푗) in linear time, by looking First, P1 computes encrypted point averages in all target clusters, up an array Δ¯ = 퐵¯ − 푅¯ storing the minimum row-wise distances in by exploiting the homomorphic properties of AHE (lines 1-7): Using Δ = 퐵 − 푅 (a known technique in information retrieval [73, Section the diagonal in matrix 푅, P1 first identifies each (of 푡 total) target 17.2.1]). Our modified protocol takes only 푂 (휅푛) comparisons per 2 cluster퐶푖 and then finds the index set 퐼푖 (over permuted input points clustering, as opposed to 푂 (휅푛 ) of our main protocol. As shown 휋2 ◦ 휋1 (푃 ∪ 푄)) of the points contained in 퐶푖 , to finally compute in Section 7, this results in significant performance improvement. Î Î 푗 ∈퐼 퐿푗 = 푗 ∈퐼 푝 푗 . The resulted 푡 encrypted point averages and Specifically, at the end of the setup phase, P1, P2 now also jointly 푖 푖 J K cluster sizes are sent to P2, who returns to P1 the 푡 plaintext point run (푗푖 ; 푗푖 ) ← ArgMin(푅푖 ; 퐵푖 ), 푖 ∈ [1 : 푛], to learn the minimum- Í = Í averages, i.e., 푗 ∈퐼푖 푝 푗 푝 푗 ∈퐶푖 푝 푗 for each 퐶푖 (lines 8-10). At this linkage index 푗푖 of the 푖th row 퐵푖 −푅푖 of Δ (excluding its 푖th location, point, both parties can form the common output (line 12). as 퐵푖,푖 , 푅푖,푖 store cluster 푖), and they both initialize array 퐽¯ as 퐽¯푖 = ¯ ¯ ¯ 푗푖 , whereas P1 initializes array 푅 as 푅푖 = 푅푖,푗푖 and P2 array 퐵 as ¯ = 5 PROTOCOL ANALYSIS 퐵푖 퐵푖,푗푖 . Then, at the start of each iteration in the clustering phase (line 5 in Algorithm 3) and assuming that ⊥ = +∞, P , P Efficiency. Asymptotically, our protocol achieves optimal perfor- 1 2 now jointly run (푖;푖) ← ArgMin(푅¯; 퐵¯) to find the closest-cluster mance, as it incurs no extra overheads to the performance costs pair (푖, 푗), 푗 = 퐽¯ , in only 푂 (휅푛) time. Conveniently, as soon as associated with running HC (ignoring the dependency on the se- 푖 they update linkages Δ = Δ , for some 푘 ≠ 푖, 푗 (lines 9-12, as curity parameter 휆). The asymptotic overheads incurred on P , P , 푖,푘 푘,푖 1 2 푌 − 푋 with (⊥;푌) ← MinDist(푅 , 푅 , 푋; 퐵 , 퐵 )), P , P also during execution of each phase of PHC, are as follows: In setup 푖,푘 푗,푘 푖,푘 푗,푘 1 2 update the joint state {퐵¯ − 푅,¯ 퐽¯} for updated row 푚 ∈ {푖, 푘}: First, phase, the cost overhead for each party is 푂 (푛2), primarily related by jointly running (푧; 푧) ← ArgMin(푅ˆ; 퐵ˆ) for arrays 푅ˆ = [푋, 푅¯ ], to the cryptographic operations needed to populate its individual 푚 퐵ˆ = [푌, 퐵¯푚] of size 2, and then, if 푧 = 1, by setting 푅¯푚 = 푅ˆ푧, state Σ. In clustering phase, each of the 푛 − ℓ푡 = 푂 (푛) total itera- ¯ = ˆ ¯ = { }\ tions incurs costs proportional the complexity of running GC-based 퐵푚 퐵푧 and 퐽푚 푖, 푘 푚. At the end of each iteration (lines ¯ = ¯ = ¯ = ⊥ protocols ArgMin, MaxDist, where the cost of garbling and evalu- 14-16), they also set 푅푗 퐵푗 퐽푗 , as needed for consistency. ating a circuit 퐶, with a total number of wires |퐶|, is 푂 (|퐶|). Thus, Protocol extensions. Our protocol can be easily adapted to han- during the ℓ-th iteration: Evaluating circuit ArgMin entails 푛2 − 2ℓ dle higher dimensions (푑 > 1). Its sub-protocol (PHC.Cluster) com- comparisons of 푙2-bit values (of cluster distances) and subtractions pares squared Euclidean distances thus it is almost unaffected by of 휅-bit values (of blinding terms), for a total size of 푂 (휅(푛2 − ℓ)); the number of dimensions; only the setup and output phases need likewise, evaluating circuit MaxDist entails a constant number of to be modified, as follows. P computes helper information 퐻, rep- comparisons of 휅-bit values and 푂 (푛) such circuits are evaluated at 2 resenting each point not by 3 but by 3푑 encryptions (i.e., line 4 of iteration ℓ; thus, the total cost during this phase is 푂 (휅푛3) for each Algorithm 2 runs independently for each dimension). Analogously, party. In output phase, the cost is 푂 (ℓ ) = 푂 (푛) for each party. Thus, 푡 P , P compute square Euclidean distances (lines 3 and 11-12) as the total running time for both parties is 푂 (휅푛3). Communication 1 2 the sum of squared per-dimension differences across all dimen- consists of 푂 (푛2) ciphertexts during setup (encrypted distances), sions (over AHE). Shuffling remains largely unaffected, besides lists 푂 (휅푛2) during each clustering round (for the garbled circuits’ truth 퐿, 푆, 푆 ′ consisting of 푑푛 encryptions each. Finally, representatives tables) and 푂 (푛2) ciphertexts during the output phase. (line 6 in Algorithm 4) are now computed over vectors of 푑 values. Optimized single-linkage protocol OPT. As described, our pro- Our protocol can also extended to other distance metrics, e.g., 퐿1, tocol exploits the associativity of operator max to update the com- 퐿2 or Euclidean, and any 퐿푝 distance for 푝 ≥ 1, with modifications plete linkage between newly formed clusters퐶 and other clusters퐶′, for computing the distance matrix during setup. With squared Eu- as the max of the linkages between 퐶’s constituent clusters and 퐶′, clidean the distances are securely computed with AHE; for other securely realized via GC-protocol (·; ·) ← MaxDist(·; ·). Single link- metrics, more elaborate sub-protocol may be required. ages can be supported readily by updating inter-cluster distances be- Security. In Appendix C, we prove the following result: tween 퐶 and 퐶′ as the min of the distances between 퐶’s constituent The CURE approximate clustering algorithm Parameters Description Value 푛, 푠 Sizes of dataset and its sample ≤ 1M, [102 : 103 ] Input: D, 푛, 푠, 푝, 푞, 푡1, 푡2, 푅 Output: Clusters C over D 푝, 푞 # parts, cluster/part control 푝 = 1, 3, 5, 푞 = 3 [] Randomly pick 푠 points in D to form sample S. 푡1, 푡2 A-, B-cluster thresholds 3 = 푡1 < 푡2 = 5 푅 Representatives per B-cluster 푅 = 1, 3, 5, 7, 10 [Clustering A] Table 1: CURE clustering parameters and values. 1. Partition S into 푝 partitions P푖 s, each of size 푠/푝. 2. Run HCAlg to cluster each P푖 into 푠/(푝푞) target clusters. Private CURE-approximate clustering. We adapt the CURE al- 3. Eliminate within each P푖 clusters of size less than 푡1. gorithm to design private protocols for approximate clustering in [Clustering B] our model for two-party joint hierarchical clustering. In apply- ing our security formulation (Section 3) and our private protocols 1. Run HCAlg to cluster all remaining A-clusters C퐴 in S. (Sections 4, 5) to this problem instance, the following facts are vital: 2. Eliminate clusters of size less than 푡2 to get B-clusters C퐵 . 1. CURE involves three main tasks: input sampling, clustering of 3. Set 푅 random points in each B-cluster as its representatives. sample, and unlabeled-points classification. [Classification] 2. Clustering involves 푝 + 1 invocations of HCAlg′, which extends 1. Assign singletons in D to B-cluster of closest representative. ordinary algorithm HCAlg to receive clusters as input and com- Figure 5: The CURE approximate clustering algorithm. pute its output over an input subset. = O O Theorem 5.1. Assuming Paillier’s encryption scheme is semanti- 3. If 푝 1 and 퐴, 퐵 are the A- and B-outliers, then: ′ S C S S\O C cally secure and that ArgMin and MaxDist are securely realized by i. HCAlg first runs on to form 퐴 over 퐴 ≜ 퐴; 퐴 is ∗ exactly the output of HCAlg run on S퐴; and next GC-based protocols, protocol PHC securely realizes functionality 푓퐻퐶 . ′ ii. HCAlg runs on C퐴 to form C퐵 over S퐵 ≜ S\{O퐴 ∪ O퐵 }; C HCAlg S 6 SCALABILITY VIA APPROXIMATION 퐵 is exactly the output of run on 퐵. Fact 1 refines our protocol-design space to only securely realizing The cryptographic machinery of our protocol imposes a notice- the clustering task, where sampling and classification are viewed able overhead in practice. Although it is asymptotically similar as input pre-processing and output post-processing of clustering. to plaintext HC, standard operations are now replaced by crypto- Specifically, P1, P2: (1) individually form random input samples graphic ones—no matter how well-optimized the code, such crypto- S푃 , S푄 of their own datasets 푃, 푄; (2) compute B-clusters and their hardened operations will ultimately be slower. Hence, to scale to representatives (as specified by CURE); and (3) use these B-cluster larger datasets, we seek to exploit approximate schemes for hierar- representatives to individually classify their own unlabeled points. chical clustering. In our case, approximation refers to performing As such, the default private realization of CURE would entail clustering over a high-volume dataset by applying the HCAlg al- having the parties perform clustering A and B jointly. Yet, since our gorithm only on a small subset of the dataset. The effect of this design space is already restricted to provide approximate solutions, is twofold: Cluster analysis is much faster but using fewer points we also consider two protocol variants, where parties trade even lowers accuracy and increases sensitivity to outliers. more accuracy for efficiency, by performing: (1) clustering A locally In what follows, we adapt the CURE approximate clustering al- and only B jointly; and, in the extreme case (2) clustering A and B gorithm [47] and seamlessly integrate it to our main protocol PHC, locally. We denote these protocols by PCure2, PCure1 and PCure0. within a flexible design framework that offers a variety of configu- In PCure0, P1, P2 non-collaboratively compute B-clusters of their rations for balancing tradeoffs between performance and accuracy, samples and announce the representatives selected. Though a de- to overall get the first variants of CURE for private collaborative hi- generate solution, as it involves no interaction, this consideration erarchical clustering. Although, in principle, our framework can be is still useful: First, to serve as a baseline for evaluating the other applied to any approximate clustering scheme (e.g., BIRCH [117]), variants, but mostly to further refine our design space. PCure0 (triv- we choose CURE for its strong resilience to outliers and high accu- ially) preserves privacy during B-cluster computation, but violates racy (even on samples less than 1% of original data)—features that the privacy guarantees offered by our point-agnostic dendrograms, place it among the best options for scalable hierarchical clustering. by revealing a subset of a party’s input points to the other party. Described in Figure 5, on input the original dataset D of size 푛 To rectify this, present also in PCure2 and PCure1, we fix 푅 = 1 and a number of approximation parameters, CURE first randomly and have each B-cluster be represented by its centroid. Using aver- samples 푠 data points from D to form sample set S. During A- age values is expected to have no impact on accuracy, at least for clustering, S is partitioned into 푝 equally-sized parts P1, P2,... P푝, spherical clusters (in [47], 푅 > 1 is only used to improve accuracy and the ordinary algorithm HCAlg runs 푝 times to form a set C퐴 of of non-spherical clusters). Fact 2 then ensures that B-clusters (and A-clusters: Its 푖th execution is on input P푖 , 푖 ∈ [1, 푝], until exactly their centroids) can be computed by essentially running algorithm 푠/(푝푞) clusters are formed, of which only those of size at least 푡1 HCAlg, possibly with slight modifications (discussed below). C are included in 퐴 and the rest are eliminated as outliers. During In PCure1, P1, P2 non-collaboratively compute A-clusters of B-clustering, HCAlg runs once again, this time over set C퐴, to form their samples and then jointly merge them to B-clusters. Semanti- a set C퐵 of B-clusters, from which clusters of size less than 푡2 > 푡1 cally, they run HCAlg, not starting at level 푛 (singletons) but at an are eventually eliminated as outliers. Finally, for each B-cluster in intermediate level 푖, where each input A-cluster contains at least C 퐵 a number of 푅 random representatives are selected, and each 푡1 points. Our PHC can be employed, with one modification: At singleton point in D is included to the B-cluster containing its setup, the parties’ joint state encodes their individual A-clusters closest representative. Table 1 summarizes suggested values for and their pairwise linkages. Accordingly, sub-protocol PHC.Setup each parameter as per CURE’s original description [47]. is modified: (1) Lines 3 and 11 now compute inter-cluster distances (of same-party pairs), and (2) lines 2 and 12 are used as a subroutine to compute all point distances across a given A-cluster pair, over which inter-cluster linkages (of cross-party pairs) are evaluated with ArgMin. The running time of modified PHC.Setup is 푂 (휆푠2), as 푂 (푠2) distances are computed across 푂 (푠) A-cluster points. In PCure2, P1, P2 jointly compute A- and B-clusters. This in- troduces the challenge of how to transition from A to B. Simply running 푝 copies of HCAlg in parallel for A-clusters does not pro- (a) Computation cost of PHC. (b) Computation cost of OPT. vide the cluster linkages that are necessary for HCAlg to compute B-clusters. Possible solutions are either to treat A-clusters as single- tons, which can drastically impair accuracy, or running an interme- diate MPC protocol to bootstrap HCAlg with cluster linkages, which can impair performance. Instead, we simply fix 푝 = 1, seamlessly using the final joint state of clustering A as initial state for cluster- ing B. Missing speedups by parallelism is compensated by avoiding a costly bootstrap-protocol, at no accuracy loss, as our experiments confirm푝 ( > 1, is only suggested for parallelism in [47]). (c) Real-data performance of PHC. (d) Real-data performance of OPT. Finally, the security of protocols PCure1 and PCure2 can be re- Figure 6: Performance of PHC (left) Vs. OPT (right). duced to that of PHC. Our modular design and facts 2 and 3, ensure that security in our private CURE-approximate clustering is cap- We generate synthetic 푑-dimensional datasets of sizes up to 1M ∗ ∈ [ ] tured by our ideal functionality 푓퐻퐶 of Section 3: The intended records and 푑 1, 20 , using a Gaussian mixture distribution, as two-party computation merely involves computing B-cluster repre- follows: (1) The number of clusters is randomly chosen in [8 : 15]; ∗ 푑 sentatives, which 푓퐻퐶 provides, and any input/output modification (2) Each cluster center is randomly chosen in [−50, 50] (perfor- in HCAlg causes a trivial change to the pre-/post-processing com- mance is dominated by 휅 but not exact data values), subject to a ∗ ponent of 푓퐻퐶 , consistent to our point-agnostic dendrograms. minimum-separation distance between pairs; (3) Cluster standard deviation is randomly chosen in [0.5, 4]; and (4) Outliers are selected uniformly at random in the same interval and assigned randomly to 7 EXPERIMENTAL EVALUATION clusters to emulate 3 noise percentage scenarios: low 0.1%, medium Our main goal is to evaluate the computational cost of our protocols 1%, and high 5%. We randomly split each dataset into two halves and to determine the improvement of the optimized and approxi- which form the private inputs of the parties. We set the number of mate variants. We use four datasets from the UCI ML Repository [2], target clusters to ℓ푡 = 5; as our protocol incurs costs linear in the restricted to numeric attributes: (1) Iris for iris plants classification number of iterations (푛 − ℓ푡 ), this choice comprises a worst-case (150 records, 4 attributes); (2) Wine for chemical analysis of wines setting, as in practice more than 5 target clusters are desired. (178 records, 13 attributes); (3) Heart for heart disease diagnosis We adapted our protocols to support floating point numbers. (303 records, 20 attributes); and (4) Cancer for breast cancer diag- Here, due to the simplicity of the involved operations, we can nostics (569 records, 30 attributes). As these are relatively small, we rely on fixed-precision floating point numbers and it suffices to also generate our own synthetic datasets, scaling the size to millions multiply floating point values by a constant 퐾 (e.g. 퐾 = 220 for IEEE of samples. Note that our protocol’s performance depends mainly 754 doubles). During PHC.Setup, we can achieve higher precision. on the dataset size, is invariant to actual data values, and varies After each party decrypts the blinded values (line 22 and line 29), very little with data dimensionality, as our experiments confirm. they can re-scale by dividing the constant 퐾 without affecting We introduced several variants of approximate clustering based precision. During Cluster, as we only merge the points based on on CURE and want to evaluate their accuracy and determine pos- the comparisons between the distances, multiplying by a constant sible between performance-accuracy tradeoffs. Traditionally, hi- does not affect the results. erarchical clustering is an unsupervised learning task, for which Finally, we use the ABY C++ framework [28], 128-bit AES for accuracy metrics are not well defined. However, it is common to GC, 1024-bits Pailler, and set 휅 = 40. We use libpaillier [3] for evaluate the accuracy of clustering via ground truth datasets in- Paillier encryption. We run our experiments on a 24-core machine, cluding class labels on samples. A good clustering algorithm will running Scientific Linux with 128GB memory on 2.9GHz Intel Xeon. generate “pure clusters” and separate data according to the ground Protocol PHC. We first report results on the performance of our truth. Each cluster will be labeled with the majority class of its PHC protocol from Section 4. Figure 6a shows the computational samples, and the accuracy of the protocol is defined as the fraction cost for synthetic datasets of various sizes and dimensions, aver- of input points that are clustered into their correct class relative aged over single and complete linkages. First, consistently with our to the ground truth. We employ this measure of accuracy to evalu- analysis in Section 5, dimensions have minimal impact, since PHC’s ate approximate clustering variants (PCure0, PCure1, and PCure2). performance relates primarily to computing inter-cluster distances Our standard privacy-preserving clustering protocol PHC and the that is minimally affected by 푑. As expected its cubic asymptotic optimized version OPT maintain the same accuracy as the original complexity, the overhead increases steeply with dataset size 푛. non-private protocol, hence we do not report accuracy for them. Figure 8: End-to-end computation of PCure1, PCure2. and CURE is 3.57%, and between PCure2 and CURE is 2.7%. For 푝 = 1 and 푅 = 1 either difference is less than 1% at 1000 samples (or more). Thus, our choice of 푝 = 1 and 푅 = 1 to protect data privacy, as argued in Section 6, does not impact the protocol’s accuracy. We also compare end-to-end computation for PCure1 and PCure2 (with OPT), 푛 = 106, 푑 = 10, 1% outliers, no sample partitioning (푝 = 1), ℓ푡 = 5 target clusters, and 푞 = 3 for A- Clustering. Figure 8 shows their good performance for sample sizes 푠 ∈ [400 : 1000]. For 103 samples, PCure2 runs in 104sec, while PCure1 runs in 35sec – 3× faster, but with similar accuracy 97.09%. Network Latency Impact. Although our experiments show the efficiency of our schemes, if executed over WAN this wouldbe Figure 7: Accuracy of CURE, PCure∗: 푝 = 1 (left), 푝 = 5 (right), affected by network latency and data transmission. To estimate this #outliers = 0.1% (top), 1% (middle), 5% (bottom). impact, we considered two AWS machines in us-east and us-west − Protocol OPT. Figure 6b shows the computational costs on syn- and measured their latency to 50 60ms. Taking PCure2 with 400 = thetic datasets for our optimized single-linkage variant OPT (with samples and ℓ푡 5 as our use case, a single clustering round with configurations identical to those for PHC). In line with our analysis four roundtrips (assuming distance update is done with a single gar- in Section 5, OPT significantly improves performance, reducing bled circuit) would take approximately 200-240ms. Regarding data running time by an order of magnitude. E.g., for datasets of 2000 transmission of the two garbled circuits for finding the minimum 20-dimensional points, the running time is approximately 230 secs, distance and updating the cluster distances, using the circuits for ad- an 8× speedup compared to PHC. The difference in our above ex- dition/subtraction, comparison, and min-index-selection from [65] ample,is explained by the following observations: (1) although OPT for 100, 64-bit values, we estimate their size as roughly 10MB (not improves performance during clustering by a linear factor, it adds including the OT data which is dominated by the circuits). Under costs during setup; and (2) the involved constants of the quadratic the mild assumption of a 100Mbps connection, transmission would ∼ < costs are higher for running time in setup phase, and vice versa in take 100ms for a total of 350ms. In subsequent rounds, the clustering phase. As shown in Figure 6d, OPT significantly improves circuits become progressively smaller but the number of roundtrips performance over PHC, also when tested over our real datasets. remains the same; conservatively multiplying by 395 rounds, we have approximately 128sec of total communication time. For com- Protocols PCure∗. Figure 7 shows the accuracy of our CURE- parison, in Figure 8, for the same setting computation takes ∼55sec. variant protocols PCure0, PCure1, PCure2 from Section 6, and the Hence, communication indeed becomes a bottleneck for our non-private CURE algorithm on synthetic datasets of 1M records schemes when run over WAN, but not to the point where they are 2 3 for 10 –10 samples, partition parameters 푝 = 1 and 푝 = 5, and for entirely impractical. Furthermore, our goal when implementing our low (0.1%), medium (1%), and high (5%) outliers-to-data percentages. schemes was not to minimize end-to-end latency but computation, Clearly, PCure0, where parties run CURE on their own samples, so there is plenty of room for optimizations. E.g., our protocols can without any interaction besides announcing representatives for indi- be run in “round batches” merging 푘 clusters with one interaction vidually computed clusters, exhibits very poor accuracy. e.g., 44.4% (by larger circuits) which would decrease RTTs by a factor of 푘. loss for 1M records. For 푝 = 1, PCure1 and PCure2 achieve similar Finally, dedicated cloud technologies, such as AWS VPC [4], can accuracy, which approaches that of CURE for large enough samples: offer private connections drastically reducing communication time. At 300 samples or higher, the gap is within 3%. For higher values of 푝, e.g., 푝 = 5, PCure1 and PCure2 exhibit a difference in accuracy: 8 RELATED WORK E.g., at 200 samples the accuracy for PCure1 is lower by 39.54% than PCure2; but at 500 samples or more, they are within 3.18%. Secure machine learning. There exists a rapidly growing line of Moreover, experimenting with all combinations of 푝 = 1, 3, 5 works that propose secure protocols for a variety of ML tasks. This partitions and 푅 = 1, 3, 5, 7, 10 representatives shows that the accu- includes constructions for private classification models in the super- racies of PCure2 and PCure1 are very close to CURE at 푠 = 1000 vised learning setting (such as decision trees [70], SVM classifica- samples (or more). The largest observed difference between PCure1 tion [108], [31, 32, 94], [38] and neural networks [12, 85, 93]), as well as federated learning tasks [15]. Leakage in machine learning. The significant impact of informa- Another focus has been on proposing MPC-based protocols that are tion leakage in collaborative, distributed, or federated learning has provably secure under a well-defined real/ideal definition, similar been the topic of a long line of research (e.g., see [6, 66, 71, 97]). Vari- to ours (e.g., [9, 16, 20, 22, 40, 42, 43, 51, 61, 67, 72, 77, 81, 90, 92]), for ous practical attacks have been demonstrated that infer information numerous tasks with a focus on neural networks and . about the training data or the ML model and its hyper-parameters, The above works can be split into two categories: those that (e.g., [39, 54, 98]). This is even more important in collaborative focus on private model training and those that focus on private learning where parties could otherwise benefit from sharing data inference/classification. In our unsupervised setting, our protocol but such leakage may stop them (e.g., [74, 112, 118, 119]). Hence, protects the privacy of the parties’ data during the clustering phase. it is crucial for our protocol to formally characterize what is the Deployed techniques. In terms of techniques, most works use shared information for the two parties. (some variant of) homomorphic encryption (e.g., [41, 86]). More advanced ML tasks often require hybrid techniques, e.g., combin- 9 CONCLUSION ing the above with garbled circuits (e.g., [92]) or other MPC tech- We propose the first formal security definition for private hierarchi- niques [75, 90]. Our construction adopts such “mixed” techniques cal clustering and design a protocol for single and complete linkage, for the problem of hierarchical clustering. More recently, solutions as well as an optimized version. We also combine this with approx- have been proposed based on trusted hardware (such as Intel SGX), imate clustering to increase scalability. We hope this work moti- e.g., [21, 82, 105]. This avoids the need for “heavy” cryptography, vates further research in privacy-preserving unsupervised learning, however, it weakens the threat model as it requires trusting the hard- including secure protocols for other linkage types (e.g., Ward), al- ware vendor. Finally, a different approach seeks privacy via data ternative approximation frameworks (e.g., BIRCH [117]), different perturbation [5, 23, 24, 83, 97, 99], by adding statistical noise to hide tasks (e.g., mixture models, association rules or graph learning), data values, e.g., differential privacy [33]. Such techniques are or- or schemes for more than two parties to benefit from larger-scale thogonal to the cryptographic methods that we apply here but they collaborations. Specific to our definition of privacy, we believe it can potentially be combined (e.g., as in [87]). Using noise to hide would be helpful to experimentally and empirically evaluate the whether a specific point has been included in a given cluster would impact (even our significantly redacted) dendrogram leakage can be complement very nicely our cluster-information-reduction ap- have, e.g., by demonstrating possible leakage-abuse attacks. proach, potentially leading to more robust security treatment. Privacy-preserving clustering. Many previous works proposed ACKNOWLEDGMENTS private solutions for different clustering tasks with the majority fo- The authors would like to thank the members of the AWS Crypto cusing on the popular, but conceptually simpler, 푘-means problem team for their useful comments and inputs, the anonymous re- (e.g., [18, 30, 35, 58–60, 64, 76, 89, 107]) and other partitioning-based viewers for their valuable feedback, and Anrin Chakraborti for clustering methods (e.g., [62, 116]). Fewer other works consider pri- shepherding this paper. Dimitrios Papadopoulos was supported by vate density-based [17, 25, 115] or distribution-based [48] clustering. the Hong Kong Research Grants Council (RGC) under grant ECS- An in-depth literature survey and comparative analysis of private 26208318. Alina Oprea and Nikos Triandopoulos were supported by clustering schemes can be found in the recent work of [49]. the National Science Foundation (NSF) under grants CNS-171763 Focusing on private hierarchical clustering, no previous work and CNS-1718782. offers a formal security definition, relying instead on ad-hoc anal- ysis [27, 55–57, 96]. Moreover some schemes leak information to the participants that can clearly be harfmul—and is much more REFERENCES than what our protocol reveals—e.g., [95, 106] reveal all distances [1] 2017. The Intelligent Trial: AI Comes To Clinical Trials. Clinical Informatics News. http://www.clinicalinformaticsnews.com/2017/09/29/the-intelligent- across parties’ records. One notable exception is the scheme of Su et trial-ai-comes-to-clinical-trials.aspx. al. [102] which, however, is designed specifically for the case of doc- [2] 2019. The UCI Machine Learning Data Repository. http://archive.ics.uci.edu/ ument clustering. Here, we proposed a security formulation within ml/index.php. [3] 2019. UTexas Paillier Library. http://acsc.cs.utexas.edu/libpaillier. the widely studied read/ideal paradigm of MPC that characterizes [4] 2021. AWS VPC. https://aws.amazon.com/vpc. precisely what information is revealed to the collaborating parties. [5] Martín Abadi, Andy Chu, Ian J. Goodfellow, H. Brendan McMahan, Ilya Mironov, Besides making it easier to compare our solution with potential Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In ACM SIGSAC CCS 2016. 308–318. future ones that follow our formulation, this is, to the best of our [6] Mohammad Al-Rubaie and J. Morris Chang. 2019. Privacy-Preserving Machine knowledge the only private hierarchical clustering scheme with Learning: Threats and Solutions. IEEE Secur. Priv. 17, 2 (2019), 49–58. https: //doi.org/10.1109/MSEC.2018.2888775 formal proofs of security. Finally, it is an interesting problem to [7] AlienVault. 2020. Open Threat Exchange. Available at https:// combine optimizations for “plaintext” clustering (e.g.,[26, 78, 84]) otx.alienvault.com/. with privacy-preserving techniques to improve efficiency. [8] Cyber Threat Alliance. 2020. Available at http://cyberthreatalliance.org/. [9] Yoshinori Aono, Takuya Hayashi, Le Trieu Phong, and Lihua Wang. 2016. Scal- Secure approximate computation. The interplay between cryp- able and Secure Logistic Regression via Homomorphic Encryption. In ACM CODASPY 2016. 142–144. tography and efficient approximation37 [ ] has already been studied [10] Gilad Asharov, Shai Halevi, Yehuda Lindell, and Tal Rabin. 2018. Privacy- for pattern matching in genomic data [10, 109], 푘-means [101], and Preserving Search of Similar Patients in Genomic Data. PoPETs 2018, 4 (2018), logistic regression [103, 111]. To the best of our knowledge, ours 104–124. https://doi.org/10.1515/popets-2018-0034 [11] Foteini Baldimtsi, Dimitrios Papadopoulos, Stavros Papadopoulos, Alessandra is the first work to compose secure cryptographic protocols with Scafuro, and Nikos Triandopoulos. 2017. Server-Aided Secure Computation efficient approximation algorithms for hierarchical clustering. with Off-line Parties. In ESORICS 2017. 103–123. [12] Mauro Barni, Pierluigi Failla, Riccardo Lazzeretti, Ahmad-Reza Sadeghi, and 14863–14868. Issue 25. Thomas Schneider. 2011. Privacy-Preserving ECG Classification With Branching [35] Zekeriya Erkin, Thijs Veugen, Tomas Toft, and Reginald L. Lagendijk. 2013. Programs and Neural Networks. IEEE Trans. Information Forensics and Security Privacy-preserving distributed clustering. EURASIP J. Information Security 2013 6, 2 (2011), 452–468. https://doi.org/10.1109/TIFS.2011.2108650 (2013), 4. https://doi.org/10.1186/1687-417X-2013-4 [13] Ulrich Bayer, Paolo Milani Comparetti, Clemens Hlauschek, Christopher [36] Facebook. 2018. Threat Exchange. Available at https://developers.facebook.com/ Kruegel, and Engin Kirda. 2009. Scalable, Behavior-Based Malware Clustering.. products/threat-exchange. In Proceedings of the 16th Symposium on Network and Distributed System Security [37] Joan Feigenbaum, Yuval Ishai, Tal Malkin, Kobbi Nissim, Martin J. Strauss, (NDSS). and Rebecca N. Wright. 2006. Secure multiparty computation of approxima- [14] Mihir Bellare, Viet Tung Hoang, and Phillip Rogaway. 2012. Foundations tions. ACM Trans. Algorithms 2, 3 (2006), 435–472. https://doi.org/10.1145/ of garbled circuits. In ACM CCS 2012. 784–796. https://doi.org/10.1145/ 1159892.1159900 2382196.2382279 [38] Stephen E. Fienberg, William J. Fulp, Aleksandra B. Slavkovic, and Tracey A. [15] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan Wrobel. 2006. "Secure" Log-Linear and Logistic of Dis- McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. 2017. tributed . In Privacy in Statistical Databases. 277–290. Practical Secure Aggregation for Privacy-Preserving Machine Learning (CCS [39] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. 2015. Model Inversion At- ’17). ACM, 1175–1191. https://doi.org/10.1145/3133956.3133982 tacks That Exploit Confidence Information and Basic Countermeasures (CCS ’15). [16] Raphael Bost, Raluca Ada Popa, Stephen Tu, and Shafi Goldwasser. 2015. Ma- ACM, New York, NY, USA, 1322–1333. https://doi.org/10.1145/2810103.2813677 chine Learning Classification over Encrypted Data. In NDSS 2015. [40] Adrià Gascón, Phillipp Schoppmann, Borja Balle, Mariana Raykova, Jack Do- [17] Beyza Bozdemir, Sébastien Canard, Orhan Ermis, Helen Möllering, Melek Önen, erner, Samee Zahur, and David Evans. 2017. Privacy-Preserving Distributed and Thomas Schneider. 2021. Privacy-preserving Density-based Clustering. In Linear Regression on High-Dimensional Data. PoPETs 2017, 4 (2017), 345–364. ASIA CCS ’21: ACM Asia Conference on Computer and Communications Security, https://doi.org/10.1515/popets-2017-0053 Virtual Event, Hong Kong, June 7-11, 2021. ACM, 658–671. https://doi.org/ [41] Craig Gentry. 2009. A Fully Homomorphic Encryption Scheme. Ph.D. Dissertation. 10.1145/3433210.3453104 Stanford, CA, USA. Advisor(s) Boneh, Dan. AAI3382729. [18] Paul Bunn and Rafail Ostrovsky. 2007. Secure two-party k-means clustering. [42] Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, In Proceedings of the 2007 ACM Conference on Computer and Communications and John Wernsing. 2016. CryptoNets: Applying Neural Networks to Encrypted Security, CCS 2007, Alexandria, Virginia, USA, October 28-31, 2007. 486–497. Data with High Throughput and Accuracy. In Proc. 33rd International Conference https://doi.org/10.1145/1315245.1315306 on Machine Learning (ICML). [19] Qiang Cao, Xiaowei Yang, Jieqi Yu, and Christopher Palow. 2014. Uncovering [43] Ran Gilad-Bachrach, Kim Laine, Kristin E. Lauter, Peter Rindal, and Mike Rosulek. Large Groups of Active Malicious Accounts in Online Social Networks. In Pro- [n.d.]. Secure Data Exchange: A Marketplace in the Cloud. In Proceedings ceedings of the 21st ACM Conference on Computer and Communications Security of the 2019 ACM SIGSAC Conference on Cloud Computing Security Workshop, (CCS). CCSW@CCS 2019. 117–128. https://doi.org/10.1145/3338466.3358924 [20] Hervé Chabanne, Amaury de Wargny, Jonathan Milgram, Constance Morel, [44] M. Girvan and M. E. J. Newman. 2002. Community structure in social and and Emmanuel Prouff. 2017. Privacy-Preserving Classification on Deep Neural biological networks. Proceedings of the National Academy of Sciences 99, 12 (11 Network. Cryptology ePrint Archive, Report 2017/035. June 2002), 7821–7826. https://doi.org/10.1073/pnas.122653799 [21] Javad Ghareh Chamani and Dimitrios Papadopoulos. 2020. Mitigating Leakage [45] Oded Goldreich, Silvio Micali, and Avi Wigderson. 1987. How to Play any in Federated Learning with Trusted Hardware. CoRR abs/2011.04948 (2020). Mental Game or A Completeness Theorem for Protocols with Honest Majority. arXiv:2011.04948 https://arxiv.org/abs/2011.04948 In ACM STOC 1987. 218–229. https://doi.org/10.1145/28395.28420 [22] Nishanth Chandran, Divya Gupta, Aseem Rastogi, Rahul Sharma, and Shardul [46] Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee. 2008. BotMiner: Tripathi. [n.d.]. EzPC: Programmable and Efficient Secure Two-Party Computa- Clustering Analysis of Network Traffic for Protocol and Structure-independent tion for Machine Learning. In IEEE European Symposium on Security and Privacy, Botnet Detection. In Proceedings of the 17th USENIX Security Symposium. EuroS&P 2019. 496–511. https://doi.org/10.1109/EuroSP.2019.00043 [47] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. 2001. Cure: An Efficient [23] Melissa Chase, Ran Gilad-Bachrach, Kim Laine, Kristin E. Lauter, and Peter Clustering Algorithm for Large Databases. Inf. Syst. 26, 1 (2001), 35–58. https: Rindal. 2017. Private Collaborative Neural Network Learning. IACR Cryptology //doi.org/10.1016/S0306-4379(01)00008-4 ePrint Archive 2017 (2017), 762. http://eprint.iacr.org/2017/762 [48] Mona Hamidi, Mina Sheikhalishahi, and Fabio Martinelli. 2018. Privacy Pre- [24] Kamalika Chaudhuri and Claire Monteleoni. 2008. Privacy-preserving logistic serving Expectation Maximization (EM) Clustering Construction. In DCAI 2018 regression. In Advances in Neural Information Processing Systems 21, 2008. 289– (Advances in Intelligent Systems and Computing, Vol. 800). Springer, 255–263. 296. https://doi.org/10.1007/978-3-319-94649-8_31 [25] Jung Hee Cheon, Duhyeong Kim, and Jai Hyun Park. 2019. Towards a Practical [49] Aditya Hegde, Helen Möllering, Thomas Schneider, and Hossein Yalame. 2021. Cluster Analysis over Encrypted Data. In Selected Areas in Cryptography - SAC SoK: Efficient Privacy-preserving Clustering. Proc. Priv. Enhancing Technol. 2021, 2019 - 26th International Conference, Revised Selected Papers (Lecture Notes in 4 (2021), 225–248. https://doi.org/10.2478/popets-2021-0068 Computer Science, Vol. 11959). Springer, 227–249. https://doi.org/10.1007/978-3- [50] W. Henecka, S. Kögl, A.-R. Sadeghi, T. Schneider, and I. Wehrenberg. 1999. Tasty: 030-38471-5_10 Tool for automating secure two-party computations. In Proc. ACM Conference [26] Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, and Claire on Computer and Communications Security (CCS). Mathieu. [n.d.]. Hierarchical Clustering: Objective Functions and Algorithms. [51] Ehsan Hesamifard, Hassan Takabi, and Mehdi Ghasemi. [n.d.]. Deep Neural In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Networks Classification over Encrypted Data. In Proceedings of the Ninth ACM Algorithms, SODA 2018, Artur Czumaj (Ed.). 378–397. https://doi.org/10.1137/ Conference on Data and Application Security and Privacy, CODASPY 2019. 97–108. 1.9781611975031.26 https://doi.org/10.1145/3292006.3300044 [27] Ipsa De and Animesh Tripathy. 2014. A Secure Two Party Hierarchical Clustering [52] Ehsan Hesamifard, Hassan Takabi, Mehdi Ghasemi, and Catherine Jones. 2017. Approach for Vertically Partitioned Data Set with Accuracy Measure. In Recent Privacy-preserving Machine Learning in Cloud. In Proceedings of the 9th Cloud Advances in Intelligent Informatics. Springer International Publishing, 153–162. Computing Security Workshop, CCSW@CCS 2017, Dallas, TX, USA, November 3, [28] D. Demmler, T. Schneider, and M. Zohner. 2015. ABY - A framework for efficient 2017, Bhavani M. Thuraisingham, Ghassan Karame, and Angelos Stavrou (Eds.). mixed-protocol secure two-party computation. In Proc. n 22nd Annual Network ACM, 39–43. and Distributed System Security Symposium (NDSS). [53] Ehsan Hesamifard, Hassan Takabi, Mehdi Ghasemi, and Rebecca N. Wright. [29] Ben Dickson. 2016. How threat intelligence sharing can help deal with cyberse- 2018. Privacy-preserving Machine Learning as a Service. Proc. Priv. Enhancing curity challenges. Available at https://techcrunch.com/2016/05/15/how-threat- Technol. 2018, 3 (2018), 123–142. intelligence-sharing-can-help-deal-with-cybersecurity-challenges/. [54] Briland Hitaj, Giuseppe Ateniese, and Fernando Pérez-Cruz. 2017. Deep Models [30] Mahir Can Doganay, Thomas Brochmann Pedersen, Yücel Saygin, Erkay Savas, Under the GAN: Information Leakage from Collaborative Deep Learning. In and Albert Levi. 2008. Distributed privacy preserving k-means clustering with ACM CCS 2017. 603–618. additive secret sharing. In PAIS 2008. 3–11. [55] Ali Inan, Selim Volkan Kaya, Yücel Saygin, Erkay Savas, Aycca Azgin Hintoglu, [31] Wenliang Du and Mikhail J. Atallah. 2001. Privacy-Preserving Cooperative and Albert Levi. 2007. Privacy preserving clustering on horizontally parti- Scientific Computations. In 14th IEEE Computer Security Foundations Workshop tioned data. Data Knowl. Eng. 63, 3 (2007), 646–666. https://doi.org/10.1016/ (CSFW-14 2001). 273–294. j.datak.2007.03.015 [32] Wenliang Du, Yunghsiang S. Han, and Shigang Chen. 2004. Privacy-Preserving [56] Geetha Jagannathan, Krishnan Pillaipakkamnatt, and Rebecca N. Wright. 2006. Multivariate Statistical Analysis: Linear Regression and Classification. In Pro- A New Privacy-Preserving Distributed k-Clustering Algorithm. In Proceedings ceedings of the Fourth SIAM International Conference on . 222–233. of the Sixth SIAM International Conference on Data Mining, April 20-22, 2006, [33] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith. 2006. Bethesda, MD, USA. SIAM, 494–498. https://doi.org/10.1137/1.9781611972764.47 Calibrating Noise to Sensitivity in Private Data Analysis. In TCC 2006. 265–284. [57] Geetha Jagannathan, Krishnan Pillaipakkamnatt, Rebecca N. Wright, and Daryl [34] Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. 1998. Umano. 2010. Communication-Efficient Privacy-Preserving Clustering. Trans. Cluster analysis and display of genome-wide expression patterns. 95 (1998), Data Privacy 3, 1 (2010), 1–25. http://www.tdp.cat/issues/abs.a028a09.php [58] Geetha Jagannathan and Rebecca N. Wright. 2005. Privacy-preserving dis- Millions of Records. In Proc. IEEE Symposium on Security and Privacy (S & P). tributed k-means clustering over arbitrarily partitioned data. In ACM SIGKDD IEEE. 2005. 593–599. https://doi.org/10.1145/1081870.1081942 [82] Olga Ohrimenko, Felix Schuster, Cédric Fournet, Aastha Mehta, Sebastian [59] Angela Jäschke and Frederik Armknecht. 2018. Unsupervised Machine Learning Nowozin, Kapil Vaswani, and Manuel Costa. [n.d.]. Oblivious Multi-Party on Encrypted Data. In Selected Areas in Cryptography - SAC 2018m Revised Machine Learning on Trusted Processors. In 25th USENIX Security Symposium, Selected Papers (Lecture Notes in Computer Science, Vol. 11349). Springer, 453–478. USENIX Security 16. 619–636. https://doi.org/10.1007/978-3-030-10970-7_21 [83] Stanley R. M. Oliveira and Osmar R. Zaïane. 2003. Privacy Preserving Cluster- [60] Somesh Jha, Luis Kruger, and Patrick McDaniel. 2005. Privacy Preserving Clus- ing by Data Transformation. In XVIII Simpósio Brasileiro de Bancos de Dados, tering. In Proceedings of the 10th European Symposium on Research in Computer Anais/Proceedings. 304–318. Security (ESORICS). [84] Clark F. Olson. 1995. Parallel Algorithms for Hierarchical Clustering. Parallel [61] Chiraag Juvekar, Vinod Vaikuntanathan, and Anantha Chandrakasan. [n.d.]. Comput. 21, 8 (1995), 1313–1325. https://doi.org/10.1016/0167-8191(95)00017-I GAZELLE: A Low Latency Framework for Secure Neural Network Inference. [85] Claudio Orlandi, Alessandro Piva, and Mauro Barni. 2007. Oblivious Neural In 27th USENIX Security Symposium, USENIX Security 2018. 1651–1669. https: Network Computing via Homomorphic Encryption. EURASIP J. Information //www.usenix.org/conference/usenixsecurity18/presentation/juvekar Security 2007 (2007). https://doi.org/10.1155/2007/37343 [62] Hannah Keller, Helen Möllering, Thomas Schneider, and Hossein Yalame. 2021. [86] P. Paillier. 1999. Public-key cryptosystems based on composite degree residuosity Balancing Quality and Efficiency in Private Clustering with Affinity Propagation. classes. In Proc. Advances in Cryptology - EUROCRYPT. Springer-Verlag. In Proceedings of the 18th International Conference on Security and Cryptography, [87] Martin Pettai and Peeter Laud. 2015. Combining Differential Privacy and Secure SECRYPT 2021, July 6-8, 2021. SCITEPRESS, 173–184. https://doi.org/10.5220/ Multiparty Computation. In Proceedings of the 31st Annual Computer Security 0010547801730184 Applications Conference, Los Angeles, CA, USA, December 7-11, 2015. ACM, 421– [63] Florian Kerschbaum, Thomas Schneider, and Axel Schröpfer. 2014. Automatic 430. Protocol Selection in Secure Two-Party Computations. In ACNS 2014. 566–584. [88] Michael O. Rabin. 1981. How to exchange secrets by oblivious transfer. Technical [64] Hyeong-Jin Kim and Jae-Woo Chang. 2018. A Privacy-Preserving k-Means Report TR-81, Aiken Computation Laboratory, Harvard University. Clustering Algorithm Using Secure Comparison Protocol and Density-Based [89] Fang-Yu Rao, Bharath K. Samanthula, Elisa Bertino, Xun Yi, and Dongxi Liu. Center Point Selection. In 11th IEEE International Conference on Cloud Comput- 2015. Privacy-Preserving and Outsourced Multi-user K-Means Clustering. In ing, CLOUD 2018. IEEE Computer Society, 928–931. https://doi.org/10.1109/ IEEE Conference on Collaboration and Internet Computing, CIC 2015, Hangzhou, CLOUD.2018.00138 China, October 27-30, 2015. IEEE Computer Society, 80–89. https://doi.org/ [65] Vladimir Kolesnikov, Ahmad-Reza Sadeghi, and Thomas Schneider. 2009. Im- 10.1109/CIC.2015.20 proved Garbled Circuit Building Blocks and Applications to Auctions and Com- [90] M. Sadegh Riazi, Christian Weinert, Oleksandr Tkachenko, Ebrahim M. Songhori, puting Minima. In CANS 2009. 1–20. Thomas Schneider, and Farinaz Koushanfar. [n.d.]. Chameleon: A Hybrid Secure [66] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. 2020. Federated Computation Framework for Machine Learning Applications. In AsiaCCS 2018. Learning: Challenges, Methods, and Future Directions. IEEE Signal Process. Mag. 707–721. https://doi.org/10.1145/3196494.3196522 37, 3 (2020), 50–60. https://doi.org/10.1109/MSP.2020.2975749 [91] R L Rivest, L Adleman, and M L Dertouzos. 1978. On Data Banks and Privacy [67] Yi Li, Yitao Duan, and Wei Xu. 2018. PrivPy: Enabling Scalable and General Homomorphisms. Foundations of Secure Computation, Academia Press (1978), Privacy-Preserving Computation. CoRR abs/1801.10117 (2018). arXiv:1801.10117 169–179. http://arxiv.org/abs/1801.10117 [92] Bita Darvish Rouhani, M. Sadegh Riazi, and Farinaz Koushanfar. 2018. Deepse- [68] Minlei Liao, Yunfeng Li, Farid Kianifard, Engels Obi, and Stephen Arcona. 2016. cure: scalable provably-secure deep learning. In DAC 2018. ACM, 2:1–2:6. Cluster analysis and its application to healthcare claims data: a study of end- https://doi.org/10.1145/3195970.3196023 stage renal disease patients who initiated hemodialysis. BMC Nephrology 17 [93] Ahmad-Reza Sadeghi and Thomas Schneider. 2008. Generalized Universal (2016). Issue 25. Circuits for Secure Evaluation of Private Functions with Application to Data [69] Yehuda Lindell and Benny Pinkas. 2009. A Proof of Security of Yao’s Protocol for Classification. In ICISC 2008. 336–353. https://doi.org/10.1007/978-3-642-00730- Two-Party Computation. J. Cryptology 22, 2 (2009), 161–188. https://doi.org/ 9_21 10.1007/s00145-008-9036-8 [94] Ashish P. Sanil, Alan F. Karr, Xiaodong Lin, and Jerome P. Reiter. 2004. Privacy [70] Y. Lindhell and B. Pinkas. 2000. Privacy Preserving Data Mining. In Proc. Ad- preserving regression modelling via distributed computation. In ACM SIGKDD vances in Cryptology - CRYPTO. Springer-Verlag. 2004. 677–682. [71] Bo Liu, Ming Ding, Sina Shaham, Wenny Rahayu, Farhad Farokhi, and Zihuai [95] Mina Sheikhalishahi, Mona Hamidi, and Fabio Martinelli. [n.d.]. Privacy Pre- Lin. 2021. When Machine Learning Meets Privacy: A Survey and Outlook. ACM serving Collaborative Agglomerative Hierarchical Clustering Construction. In Comput. Surv. 54, 2, Article 31 (March 2021), 36 pages. https://doi.org/10.1145/ Information Systems Security and Privacy - 4th International Conference, ICISSP 3436755 2018, Vol. 977. 261–280. https://doi.org/10.1007/978-3-030-25109-3_14 [72] Jian Liu, Mika Juuti, Yao Lu, and N. Asokan. 2017. Oblivious Neural Network [96] Mina Sheikhalishahi and Fabio Martinelli. 2017. Privacy preserving clustering Predictions via MiniONN Transformations. In ACM SIGSAC CCS. 619–631. https: over horizontal and vertical partitioned data. In IEEE ISCC 2017. 1237–1244. //doi.org/10.1145/3133956.3134056 https://doi.org/10.1109/ISCC.2017.8024694 [73] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. In- [97] Reza Shokri and Vitaly Shmatikov. 2015. Privacy-Preserving Deep Learning. In troduction to information retrieval. Cambridge University Press. ACM SIGSAC CCS 2015. 1310–1321. [74] Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. [98] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. [n.d.]. Exploiting Unintended Feature Leakage in Collaborative Learning. In Membership Inference Attacks Against Machine Learning Models. In 2017 IEEE 2019 IEEE Symposium on Security and Privacy, SP 2019. 691–706. https://doi.org/ Symposium on Security and Privacy. 3–18. 10.1109/SP.2019.00029 [99] Shuang Song, Kamalika Chaudhuri, and Anand D. Sarwate. 2013. Stochastic [75] Pratyush Mishra, Ryan Lehmkuhl, Akshayaram Srinivasan, Wenting Zheng, and gradient descent with differentially private updates. In IEEE Global Conference Raluca Ada Popa. [n.d.]. Delphi: A Cryptographic Inference Service for Neural on Signal and Information Processing 2013. 245–248. https://doi.org/10.1109/ Networks. In 29th USENIX Security Symposium, USENIX Security 2020. 2505–2522. GlobalSIP.2013.6736861 https://www.usenix.org/conference/usenixsecurity20/presentation/mishra [100] Ion Stoica, Dawn Song, Raluca Ada Popa, David A. Patterson, Michael W. [76] Payman Mohassel, Mike Rosulek, and Ni Trieu. 2020. Practical Privacy- Mahoney, Randy H. Katz, Anthony D. Joseph, Michael I. Jordan, Joseph M. Preserving K-means Clustering. Proc. Priv. Enhancing Technol. 2020, 4 (2020), Hellerstein, Joseph E. Gonzalez, Ken Goldberg, Ali Ghodsi, David Culler, and 414–433. https://doi.org/10.2478/popets-2020-0080 Pieter Abbeel. 2017. A Berkeley View of Systems Challenges for AI. CoRR [77] Payman Mohassel and Yupeng Zhang. 2017. SecureML: A System for Scalable abs/1712.05855 (2017). arXiv:1712.05855 http://arxiv.org/abs/1712.05855 Privacy-Preserving Machine Learning. In IEEE Security and Privacy 2017. 19–38. [101] Chunhua Su, Feng Bao, Jianying Zhou, Tsuyoshi Takagi, and Kouichi Sakurai. https://doi.org/10.1109/SP.2017.12 2007. Privacy-Preserving Two-Party K-Means Clustering via Secure Approxi- [78] Fionn Murtagh and Pedro Contreras. 2017. Algorithms for hierarchical cluster- mation. In AINA 2007. 385–391. ing: an overview, II. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 7, 6 (2017). [102] Chunhua Su, Jianying Zhou, Feng Bao, Tsuyoshi Takagi, and Kouichi Sakurai. https://doi.org/10.1002/widm.1219 2014. Collaborative agglomerative document clustering with limited information [79] Terry Nelms, Roberto Perdisci, and Mustaque Ahamad. 2013. ExecScent: Mining disclosure. Security and Communication Networks 7, 6 (2014), 964–978. https: for New Domains in Live Networks with Adaptive Control Protocol Templates. //doi.org/10.1002/sec.811 In Proceedings o the 22nd USENIX Security Symposium. [103] Toshiyuki Takada, Hiroyuki Hanada, Yoshiji Yamada, Jun Sakuma, and Ichiro [80] Sophia R. Newcomer, John F. Steiner, , and Elizabeth A. Bayliss. 2011. Identifying Takeuchi. 2016. Secure Approximation Guarantee for Cryptographically Pri- Subgroups of Complex Patients With Cluster Analysis. American Journal of vate Empirical Risk Minimization. In ACML 2016. 126–141. http://jmlr.org/ Managed Care 17 (2011), 324–332. Issue 8. proceedings/papers/v63/takada48.html [81] Valeria Nikolaenko, Udi Weinsberg, Stratis Ioannidis, Marc Joye, Dan Boneh, [104] Harry Chandra Tanuwidjaja, Rakyong Choi, Seunggeun Baek, and Kwangjo and Nina Taft. 2013. Privacy-Preserving Ridge Regression on Hundreds of Kim. 2020. Privacy-Preserving Deep Learning on Machine Learning as a Service - a Comprehensive Survey. IEEE Access 8 (2020), 167425–167447. In our example, C푓 corresponds to a circuit of two AND gates [105] Florian Tramèr and Dan Boneh. [n.d.]. Slalom: Fast, Verifiable and Private 퐴, 퐵 and an OR gate 퐶, shown in Figure 9: Inputs 푥1, 푥2 are 11 and Execution of Neural Networks in Trusted Hardware. In 7th International Confer- ence on Learning Representations, ICLR 2019. https://openreview.net/forum?id= 01, and output 푓 (푥1, 푥2) is 1, computed by feeding to the OR gate rJVorjCcKQ the two bitwise ANDs of the inputs. [106] A. Tripathy and I. De. 2013. Privacy Preserving Two-Party Hierarchical Clus- C P tering Over Vertically Partitioned Dataset. Journal of Software Engineering and To garble 푓 , 1 first maps (the two possible bits 0, 1 of) each Applications 06 (2013), 26–31. C 0 1 wire 푋 in 푓 to two random values 푤푋 ,푤푋 (from a large domain, [107] Jaideep Vaidya and Chris Clifton. 2003. Privacy-preserving k-means clustering { }128 over vertically partitioned data. In ACM SIGKDD 2003. 206–215. e.g., 0, 1 ), called the garbled values of 푋. [108] Jaideep Vaidya, Hwanjo Yu, and Xiaoqian Jiang. 2008. Privacy-preserving SVM Specifically, P1 maps the output wires of gates 퐴, 퐵, and 퐶 to ran- classification. Knowl. Inf. Syst. 14, 2 (2008), 161–178. https://doi.org/10.1007/ dom garbled values {푤0 ,푤1 }, {푤0 ,푤1 } and respectively {푤0 ,푤1 }, s10115-007-0073-7 퐴 퐴 퐵 퐵 퐶 퐶 [109] Xiao Shaun Wang, Yan Huang, Yongan Zhao, Haixu Tang, XiaoFeng Wang, and and also maps the two input wires of gate 퐴 (respectively, gate Diyue Bu. 2015. Efficient Genome-Wide, Privacy-Preserving Similar Patient { 0 1 } { 0 1 } 퐵) to random garbled values 푤11,푤11 , 푤21,푤21 (respectively, Query Based on Private Edit Distance (CCS ’15). ACM, 492–503. https://doi.org/ { 0 1 } { 0 1 } 10.1145/2810103.2813725 푤12,푤12 , 푤22,푤22 ), where mnemonically the 푖-th input bit of [110] M.R. Weir, E.W. Maibach, G.L. Bakris, H.R. Black, P. Chawla, F.H. Messerli, J.M. party P푗 corresponds to the 푖 푗-wire, 푖, 푗 ∈ {1, 2}. Neutel, and M.A. Weber. 2000. Implications of a health lifestyle and medication Next, P1 sends to P2 the garbled truth table of every Boolean analysis for improving hypertension control. Archives of Internal Medicine 160 (2000), 481–490. Issue 4. gate in C푓 , which is the permuted encrypted truth table of the gate, [111] Wei Xie, Yang Wang, Steven M. Boker, and Donald E. Brown. 2016. PrivLogit: Effi- where row in the truth table is appropriately encrypted using the cient Privacy-preserving Logistic Regression by Tailoring Numerical Optimizers. CoRR abs/1611.01170 (2016). arXiv:1611.01170 http://arxiv.org/abs/1611.01170 garbled values of its three associated wires. We only specify the [112] Hongyang Yan, Li Hu, Xiaoyu Xiang, Zheli Liu, and Xu Yuan. 2021. PPCL: garbled truth table of the AND gate 퐴, as other gates can be handled Privacy-preserving collaborative learning for mitigating indirect information similarly. The row (1, 1) → 1 in the truth table of 퐴 dictates that the leakage. Inf. Sci. 548 (2021), 423–437. https://doi.org/10.1016/j.ins.2020.09.064 [113] Andrew Chi-Chih Yao. 1982. Protocols for Secure Computations (Extended output is 1 when input is 1, 1 or, using garbled values, that the output 1 1 1 Abstract). In 23rd Annual Symposium on Foundations of Computer Science, 1982. is 푤퐴 when input is 푤11,푤21. Accordingly, using a semantically- 160–164. https://doi.org/10.1109/SFCS.1982.38 secure symmetric encryption scheme 퐸푘 (·) (e.g., 128-bit AES), P1 [114] Andrew Chi-Chih Yao. 1986. How to Generate and Exchange Secrets (Extended 1 Abstract). In 27th Annual Symposium on Foundations of Computer Science, 1986. can express this condition as ciphertext 퐸 1 (퐸 1 (푤 )), where the 푤11 푤21 퐴 162–167. https://doi.org/10.1109/SFCS.1986.25 푤1 successively 푤1 ,푤1 [115] Samee Zahur and David Evans. 2013. Circuit Structures for Improving Efficiency output 퐴 is encrypted using the inputs 11 21 as of Security and Privacy Tools. In 2013 IEEE Symposium on Security and Privacy, encryption keys. P1 produces a similar ciphetext for each other row SP 2013, Berkeley, CA, USA, May 19-22, 2013. IEEE Computer Society, 493–507. in the truth table of 퐴 and sends them to P2, permuted to hide the https://doi.org/10.1109/SP.2013.40 1 [116] Qingchen Zhang, Laurence T. Yang, Zhikui Chen, and Peng Li. 2017. PPHOPCM: order of the rows. Observe that one can retrieve 푤퐴 if and only if Privacy-preserving High-order Possibilistic c-Means Algorithm for Big Data 1 1 1 1 they possess both 푤11,푤21, and that if one possesses only 푤11,푤21, Clustering with Cloud Computing. IEEE Transactions on Big Data (2017), 1–1. 1 https://doi.org/10.1109/TBDATA.2017.2701816 all other entries of the garbled truth table of 퐴 (besides 푤퐴) are [117] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An Efficient indistinguishable from random, due to the semantic security of the Data Clustering Method for Very Large Databases. In ACM SIGMOD 1996. 103– encryption scheme 퐸 (·). 114. 푘 [118] Lingchen Zhao, Qian Wang, Qin Zou, Yan Zhang, and Yanjiao Chen. 2020. Finally, to allow P2 to retrieve the final output 푓 (푥1, 푥2), P1 also Privacy-Preserving Collaborative Deep Learning With Unreliable Participants. sends the garbled values 푤0 ,푤1 of the output wire together with IEEE Trans. Inf. Forensics Secur. 15 (2020), 1486–1500. https://doi.org/10.1109/ 퐶 퐶 TIFS.2019.2939713 their corresponding mapping to 0 and 1. Note that P2 is no privy [119] Qi Zhao, Chuan Zhao, Shujie Cui, Shan Jing, and Zhenxiang Chen. 2020. Pri- to any other mappings between wires’ garbled values and their vateDL PrivateDL : Privacy-preserving collaborative deep learning against possible bit values. leakage from gradient sharing. Int. J. Intell. Syst. 35, 8 (2020), 1262–1279. https://doi.org/10.1002/int.22241 In phase II, P2 evaluates the entire circuit C푓 over the received [120] Jan Henrik Ziegeldorf, Jens Hiller, Martin Henze, Hanno Wirtz, and Klaus garbled truth tables of the gates in it, by evaluating gates one by Wehrle. [n.d.]. Bandwidth-Optimized Secure Two-Party Computation of Minima. In CANS 2015. 197–213. Truth table for C Output table 1 1 E 1 (E 1 (w )) 1 wA wB C wC 1 1 1 0 | wC E 1 (E 0 (w )) A GARBLED CIRCUITS w 0 wA wB C C | 1 E 0 (E 1 (w )) Garbled circuits (GC) [113, 114] provide a general framework for wA wB C 0 E 0 (E 0 (w )) securely realizing two-party computation of any functionality. The Truth table for A C wA wB C 1 framework has been thoroughly studied in the literature (e.g., see E 1 (E 1 (w )) 0 1 w11 w21 A wA wB 0 Truth table for B formal treatments of the topic [14, 69]) and we here overview the Ew1 (Ew0 (w )) 1 11 21 A Ew1 (Ew1 (w )) specific procedures involved in it. 0 12 22 B Ew0 (Ew1 (w )) 0 11 21 A Ew1 (Ew0 (w )) In our running example, parties P and P wish to evaluate a 0 12 22 B 1 2 E 0 (E 0 (w )) A B 0 w11 w21 A E 0 (E 1 (w )) w12 w22 B specific function 푓 over their respective inputs 푥1, 푥2 and engage 0 w1 w0 w1 w1 E 0 (E 0 (w )) in an interactive 2-phase protocol, where one party plays the role 11 21 12 22 w12 w22 B via OT of the garbler and the other the role of the evaluator. Without loss via OT of generality, P is the garbler and P is the evaluator, and their 11 01 1 2 P1 P interaction proceeds as follows. 2 In phase I, P1 expresses 푓 as a Boolean circuit C푓 , i.e., as a directed acyclic graph of Boolean AND and OR gates, and then sends a Figure 9: Garbled circuit C푓 of a specific function 푓 that com- “garbled,” i.e., encrypted, version of C푓 to P2. putes the OR over the pairwise ANDs of the 2-bit inputs. MUX MUX MUX Figure 10 shows the circuit of protocol ArgMin for selecting MIN MIN ··· MIN the index of the minimum value in an array of 푛 different values. + SUB SUB SUB ··· SUB On input 푛 휅 1-bit values 푣1, . . . , 푣푛 and 푛 휅-bit blinding terms v r v r v3 r3 v r 1 1 2 2 n n 4For this, we need to assume that the encryption scheme allows detection of well-formed decryp-

MUXlog n MUXlog n MUXlog n tions, i.e., it is possible to deduce whether the retrieved plaintext has a correct format. This can be CON CON CON CON easily achieved using a blockcipher and padding with a sufficient number of 0’s, in which case well- 1 2 3 ··· n formed decryptions will have a long suffix of 0’s and decryptions under the wrong key will have a suffix of random bits. This property is referred toas verifiable range in [69]. Figure 10: The circuit for protocol ArgMin. 푟1, . . . , 푟푛, the circuit first uses 푛 SUB gates to compute (the secret) values 푣푖 −푟푖 , 푖 = 1, . . . , 푛, and then selects the index of the minimum one in the ordering hierarchy induced by (the DAG structure of) such value in 푛 − 1 successive comparisons as follows. In the 푖th C푓 . Indeed, if P2 knows the 푤 value of each input wire of a gate comparison, a MIN gate compares the currently minimum value and its garbled truth table, P2 can easily discover its output value, 푚푖 of index 푙표푐푖 (initially, 푚1 = 푣1 −푟1, 푙표푐1 = 1) to value 푣푖+1 −푟푖+1 by attempting to decrypt all rows in the table and accepting only of index 푖 + 1, and its output bit is fed, as the selector bit, to two the one that returns a correct output value. For example, if P2 has multiplexer gates MUX : 1 0 휇 푤11,푤21, P2 can try to decrypt every value in the garbled truth • 휇 = log푛: once for selecting among two (log푛)-bit indices 1 4 table of 퐴, until P2 finds the correct value 푤퐴. 푙표푐푖 and 푖 +1, the latter conveniently encoded as the output of To initiate this circuit-evaluation process, P2 needs to learn the constant gate CON푖+1 (such hard-coded indices significantly garbled values of each of the input wires in C푓 , which is achieved as facilitate their propagation in the circuit, compared to the 1 1 follows: (1) P1 sends to P2 the 푤 values 푤11,푤12 corresponding to alternative of handling indexes as input and carrying them the input wires of P1 in the clear (note that since these are random over throughout the circuit); and values, P2 cannot map them to 0 or 1, thus P1’s input is protected); • 휇 = 휆: once for selecting among two 휆-bit values 푚푖 and 0 1 (2) P2 privately query from P1 the 푤 values 푤21,푤22 corresponding 푣푖+1 − 푟푖+1, to the input wires of P2, that is, without P1 learning which garbled overall propagating the updated minimum value 푚푖+1 = min{푎,푏} values were queried, via a two-party secure computation protocol and its index 푙표푐푖+1 to the next (푖 + 1)th comparison. The final called 1-out-of-2 oblivious transfer (OT) [88]. At a very high level, output (see arrow wire) corresponds to the output of the (푛 − 1)th and focusing on a single input bit, OT allows P2 to retrieve from index-selection multiplexer gate. 0 1 P1 exactly one value in pair (푤21, 푤21) without P1 learning which value was retrieved. After running the OT protocol for every input bit, P2 can evaluate C푓 , as above, to finally compute and send back ADD to P1 the correct output 푓 (푥1, 푥2) = 1, deduced by the final garbled 1 MUX value 푤퐶 of the output wire. MIN B SECURE min-SELECTION PROTOCOLS SUB SUB Here, we overview the design of GC-based protocols ArgMin and MaxDist/MinDist for secure selection of min/max values, or their u r1 r2 v r0 index/location, over secret-shared data. These protocols have been Figure 11: The circuit for protocol MinDist. defined in Sections 4 and 5 and comprise integral components of our solutions. We provide the exact two circuits over which we can directly apply the garbled-circuits framework (see Appendix A) to Figure 11 shows the circuit for protocol MinDist for selecting and get GC-protocols ArgMin and MinDist, noting that the circuit in re-blinding the minimum value among two secret-shared values. On input two 휅 + 1-bit blinded values 푢, 푣 and three 휅-bit blinding support of MaxDist is similar to the case of MinDist. ′ Recall that data consists of 휆-bit values and is secret shared terms 푟1, 푟2, 푟 , the circuit first computes 푢 −푟1, 푣 −푟2 using two SUB among the two parties as 휅-bit random blinding terms, 휅 > 휆, and gates, then computes the minimum of these two values using a MIN 휅 + 1-bit blinded values, each resulted by adding a random blinding gate, and its output bit is fed, as the selector bit, to a multiplexer term to an ordinary data value. gate MUX휆 for selecting the minimum among two 휆-bit values 푢 −푟1 and 푣 −푟2, which is becomes the final output (see arrow wire) Our circuits use as building blocks the following gates, efficient ′ implementations of which are well studied [65]: after it is blinded by adding the input blinding term 푟 through a • ADD/SUB that adds/subtracts 휅 + 1-bit integers; ADD gate. (The circuit for protocol MaxDist is the same with a • MIN/MAX that selects the min/max of two 휆-bit integers, MAX gate replacing the MIN gate.) using a one-bit output to encode which input value is the min/max value (e.g., on input 3, 5 MIN outputs 0 to indicate C PROOF OF THEOREM 5.1 the first value is smaller); We begin by recalling that, under the assumption that the oblivious • a multiplexer gate MUX푖 that on input two 푖-bit inputs and a transfer protocol used is secure, there exists simulator Sim푂푇 that selector bit 푠, outputs the first or the second one, depending can simulate the views of each of the parties P1, P2 during a single on the value of 푠; and finally oblivious transfer execution when given as input the corresponding • hard-coded in the circuit constant gates CON푖 , 1 ≤ 푖 ≤ 푛, party’s input (and output, in case it is non-empty) and randomness. that always output the (log푛)-bit fixed value 푖 (e.g., CON3 outputs the binary representation of 3). The core idea behind our proof is that, since all values seen by the The simulator also includes in the view of P2 the garbled two parties during the protocol execution (apart from the indexes of inputs for the corresponding elements from R. the merged clusters at each clustering round) are “blinded” by large • (Oblivious transfer simulation for 푂푇ℓ,푘 ) Let 푦ℓ,푘 be the random factors, these values can be perfectly simulated, as needed input of P2 for the circuit 퐺퐶ℓ,푘 (i.e., the execution of ArgMin in our proof, by randomly selected values. For example, assuming all for index 푘 during clustering round ℓ). Since we are in the values 푝푖, 푞푖 are 32-bits and the chosen random values are 100-bits, semi-honest case, the corrupted P2 will provide as input it follows that the sum of the two is statistically indistinguishable the values that have been established from the interaction ′ from a 100-bit value chosen uniformly at random. In particular, with P1 (using the points 푝푖 ) up to that point, therefore this allows the simulator to effectively run the protocol with the 푦ℓ,푘 can be computed by the simulator. In order to compute adversary by simply choosing simulated values for the other party the parts of the view that correspond to each of 푂푇ℓ,푘 the which he chooses himsellf at random (in the above example these simulator includes in the view the output of Sim푂푇 on input would be random 32-bit values). 푦ℓ,푘 and the corresponding choice from each pair of garbled We handle the two cases of party corruption separately. inputs he chose in the previous step (as dictated by the bit ′ Corruption of P2. The view of P2 during the protocol execution representation of 푦ℓ,푘 ), which we denote as 푂푇ℓ,푘 . consists of: • (Encrypted representatives computation) For ℓ = (1) Encrypted matrices 퐵, 푅 and encrypted arrays 퐿, 푆. 1, . . . , ℓ푡 , the simulator computes 푟푒푝ℓ = ⌈푟푒푝ℓ /|퐽ℓ | · |퐽푖 |⌉ (2) For each clustering round ℓ, messages received during the and 퐸ℓ = [푟푒푝ℓ ], where encryption is under (the previously oblivious transfer execution for ArgMin, denoted by 푂푇ℓ and computed) 푝푘. the min/max index 훼ℓ . We now argue that the view produced by our simulator is indis- (3) During each clustering round ℓ, for each execution of tinguishable from the view of P2 when interacting with P1 running MinDist/MaxDist for index 푘, messages received during PHC. This is done via the following sequence of hybrids. the corresponding oblivious transfer execution, denoted by Hybrid 0. This is the view viewAPHC , i.e., the view of P2 when 푃 푂푇 퐺퐶 2 ℓ,푘 , corresponding garbled circuit ℓ,푘 , and output value interacting with P1 running PHC for points 푝푖 . 푣 ℓ,푘 . Hybrid 1. This is the same as Hybrid 0, but the output of 퐺퐶ℓ in (4) Encrypted cluster representative values 퐸 , . . . , 퐸ℓ . 1 푡 viewAPHC is replaced by 훼ℓ . This is indistinguishable from Hybrid The simulator Sim , on input the random tape 푅 , points 푃2 푃2 2 0 due to the correctness of the garbling scheme. Since we are in the 푞 , . . . , 푞 , outputs (푟푒푝 /|퐽 |, |퐽 |, . . . , 푟푒푝 /|퐽 |, |퐽 |), 훼 , . . . , 훼 , 1 푛2 1 1 1 ℓ푡 ℓ푡 ℓ푡 1 ℓ푡 semi-honest setting, both parties follow the protocol, therefore the computes the view of P as follows. 2 outputs they evaluate are always 훼 . • (Ciphertext computation) Using random tape 푅 , the sim- ℓ 2 Hybrid 2. This is the same as Hybrid 1, but values in 퐵, 퐿 are ulator runs the key generation algorithm for P to receive 2 computed using values 푝′. This is statistically indistinguishable 푠푘 ′, 푝푘 ′ 푝′, . . . , 푝′ 푖 . He then chooses values 1 푛1 uniformly at ran- 푑 from Hybrid 1 (i.e., even unbounded algorithms can only distinguish dom from {0, 1} . These will act as the “simulated” values 휅 between the two with probability 푂 (2 ) since in viewAPHC , each of for player P1. He then runs protocol PHC honestly using 푃2 ′ the values in 퐵, 퐿 are computed as the sum of a random value from the values 푝 as input for P1 (and the actual values 푞푖 of P2), 푖 { }휅 with the following modifications. 0, 1 and a distance between two clusters. Hybrid 3. This is the same as Hybrid 2, but all values in 푅, 푆 are re- • (Oblivious transfer simulation for 푂푇ℓ ) For ℓ = 1, . . . , ℓ푡 placed with encryptions of zero’s. This is indistinguishable from Hy- let 푊ℓ be the set of garbled input values computed by P2 for the garbled circuit that evaluates MinDist/MaxDist at clus- brid 2 due to the semantic security of Paillier’s encryption scheme. tering round ℓ. Since we are in the semi-honest setting, the Hybrid 4. This is the same as Hybrid 3, but each of 푂푇ℓ is replaced by 푂푇 ′, computed as described above. This is indistinguishable corrupted P2 computes these values uniformly at random. ℓ from Hybrid 3 due to the security of the oblivious transfer protocol. Therefore, the simulator can also compute them using 푅2. Then, for 푖 = 1, . . . , ℓ, the simulator includes in the view (in- Hybrid 5. This is the same as Hybrid 4, but the garbled inputs ′ (2) given to P2 for 퐺퐶ℓ,푘 are chosen based on the values that have stead of 푂푇ℓ ) the output 푂푇ℓ produced by simulator Sim푂푇 ′ 5 been computed using values 푝푖 . Since garbled inputs are chosen on input 푊ℓ . Note that P2 does not receive any output from uniformly at random (irrespectively of the actual input values), this (2) this oblivious transfer execution, thus Sim푂푇 only works follows the same distribution as Hybrid 3. given the input. Hybrid 6. This is the same as Hybrid 5, but each of 푂푇ℓ,푘 is re- • (Oblivious transfer simulation for Argmin) For each placed by output of 푂푇ℓ,푘 computed as described above. This is clustering round ℓ, the simulator includes in the view, the indistinguishable from Hybrid 5 due to the security of the oblivious index 훼ℓ . transfer protocol. • (Garbled circuit simulation for 퐺퐶ℓ,푘 ) Next, the simula- Hybrid 7. This is the same as Hybrid 6, but each value 퐸푖 sens to ′ tor needs to compute the garbled circuits 퐺퐶ℓ,푘 . The simu- P2 is computed as [⌈푟푒푝푖 /|퐽푖 | · |퐽푖 |⌉] using public key 푝푘 . This is lator uses the corresponding values from R (as computed indistinguishable from Hybrid 6 since we are in the semi-honest so far) and a “new” blinding factor 휌ℓ,푘 for P1’ inputs and setting and both parties follow the protocol therefore the outputs computes a garbled circuit for evaluating ArgMin honestly. they evaluate are always 푟푒푝푖 /|퐽푖 |. 5 And corresponding randomness derived from 푅2. Note that Hybrid 7 corresponds to the view produced by our sim- be computed with the simulator since he has access to 푝푖 , ulator and Hybrid 0 to the view that P2 receives while interacting 푅1). with P1 during 휋퐻퐶 which concludes this part of the proof. • (Oblivious transfer simulation for MinDist/MaxDist) For each 퐺퐶 let 푊 be the set of garbled input val- Corruption of P1. The case where P1 is corrupted is somewhat ℓ,푘 ℓ,푘 ues computed by P1 for the garbled circuit that evaluates simpler as he does not receive any outputs from the circuits 퐺퐶ℓ,푘 . MinDist/MaxDist at clustering round ℓ and cluster 푘. Since The view of P1 during the protocol execution consists of: (1) Encrypted tables 퐷, 푅 and encrypted arrays 퐻, 퐿, 푆. we are in the semi-honest setting, the corrupted P1 computes these values uniformly at random. Therefore, the simulator (2) For each clustering round ℓ, a garbled circuit 퐺퐶ℓ for evalu- ating ArgMin, messages received during the corresponding can also compute them using random tape 푅1. Then, for each ℓ, 푘 the simulator includes in the view (instead of 푂푇 ) the oblivious transfer execution denoted by 푂푇ℓ . ℓ,푘 ′ (1) (3) During each clustering round ℓ, for each execution of output 푂푇ℓ,푘 produced by simulator Sim푂푇 on input 푊ℓ,푘 MinDist/MaxDist for index 푘, messages received during the (and corresponding randomness derived from 푅1). Note that corresponding oblivious transfer execution denoted by 푂푇ℓ,푘 . P1 does not receive any output from this oblivious transfer (1) The simulator Sim푃1 , on input the random tape 푅1, points execution, thus Sim only works given the input. We now argue that the view푂푇 produced by our simulator is indis- 푝1, . . . , 푝푛1 , outputs (푟푒푝1/|퐽1|, |퐽1|, . . . , 푟푒푝ℓ푡 /|퐽ℓ푡 |, |퐽ℓ푡 |), 훼1, . . . , 훼ℓ푡 , computes the view of P1 as follows. tinguishable from the view of P1 when interacting with P2 running • (Ciphertext computation) Using random tape 푅1, the sim- PHC. This is done via the following sequence of hybrids. ulator runs the key generation algorithm for P1 to receive Hybrid 0. This is the view viewAPHC , i.e., the view of P1 when ′ ′ 푃1 푠푘, 푝푘 and computes a pair 푠푘 , 푝푘 for himself. He computes interacting with P2 running 휋퐻퐶 for points 푞푖 . ′ 퐷, 퐻, 퐿 consisting of encryptions of zeros under 푝푘 . More- Hybrid 1. This is the same as Hybrid 0, but all values in 퐷, 퐻 ′, 퐿 over, he computes 푅,푆 consisting of encryption of values are replaced with encryptions of zero’s. This is indistinguishable chosen uniformly at random from {0, 1}휅 and encrypted from Hybrid 1 due to the semantic security of Paillier’s encryption under 푝푘. scheme. • (Garbled circuit simulation for 퐺퐶ℓ ) Next the simula- Hybrid 2. This is the same as Hybrid 1, but values in 푅, 푆 are tor needs to provide garbled circuits for the evaluation of computed as encryptions of values chosen uniformly at random ArgMin for each clustering round ℓ. For this, the simulator from {0, 1}휅 under key 푝푘. This is statistically indistinguishable ′ creates a “rigged” garbled circuit 퐺퐶ℓ that always outputs 훼ℓ , from Hybrid 1 for the same reasons as for the case of P2 above. irrespectively of the inputs. This is achieved by forcing all Hybrid 3. This is the same as Hybrid 2, but each of 퐺퐶ℓ is replaced intermediate gates to always return the same garbled output ′ by 퐺퐶ℓ , computed as described above (including the values from and by setting the output translation temple to always to de- 푊(2) ) This is indistinguishable from Hybrid 2 due to the security of code to the bit-representation of 훼ℓ (this process is explained encryption scheme used for the garbling scheme (this is formally formally in [69]). described in [69]). • (1) Hybrid 4. 푂푇 (Oblivious transfer simulation for ArgMin) Let 푊ℓ , This is the same as Hybrid 3, but each of ℓ is replaced (2) ′ by 푂푇ℓ , computed as described above. This is indistinguishable 푊ℓ be the sets of pairs of input garbled values that the sim- ulator choses while creating 퐺퐶′ as described above (where from Hybrid 3 due to the security of the oblivious transfer protocol. ℓ Hybrid 5. This is the same as Hybrid 4, but each of 푂푇 is replaced the former corresponds to the input of P1 and the latter to ℓ,푘 by 푂푇 ′ computed as described above. This is again indistinguish- the input of P2). The simulator includes in the view a random ℓ,푘 choice from each pair in 푊 (2) . Moreover, he replaces the able from Hybrid 5 due to the security of the oblivious transfer messages in the view that correspond to the execution of protocol. (1) ( (1) ) Note that Hybrid 5 corresponds to the view produced by our sim- 푂푇ℓ,푘 , by the output of Sim on input 푦ℓ,푊ℓ , where 푦ℓ 푂푇 ulator and Hybrid 0 to the view that P2 receives while interacting is the bit description of the input of P1 for 퐺퐶ℓ (which can with P1 during PHC which concludes this part of the proof.