
Privacy-Preserving Hierarchical Clustering: Formal Security and Eicient Approximation Xianrui Meng Dimitrios Papadopoulos Amazon Research Hong Kong University of Science and Technology [email protected] [email protected] Alina Oprea Nikos Triandopoulos Northeastern University Stevens Institute of Technology [email protected] [email protected] ABSTRACT faster away. us, scalability to large inputs is very important, Machine Learning (ML) is widely used for predictive tasks in a especially so in unsupervised learning, where model inference uses number of critical applications. Recently, collaborative or federated unlabelled observations. Second, the more varied the data collected learning is a new paradigm that enables multiple parties to jointly for analysis, the richer the inferred information, as degradation due learn ML models on their combined datasets. Yet, in most applica- to noise is reduced and domain coverage is increased. Indeed, for a tion domains, such as healthcare and security analytics, privacy given learning objective, say classication or anomaly detection, risks limit entities to individually learning local models over the combining more datasets of similar type but dierent origin, allows sensitive datasets they own. In this work, we present the rst for- for more elaborate analysis and discovery of complex hidden struc- mal study for privacy-preserving collaborative hierarchical clustering, tures (e.g., correlation or causality). us, the predictive power of overall featuring scalable cryptographic protocols that allow two ML models naturally improves when they are built as global models parties to privately compute joint clusters on their combined sen- over multiple datasets owned and contributed by dierent entities, sitive datasets. First, we provide a formal denition that balances in what is termed collaborative learning—and widely considered as accuracy and privacy, and we present a provably secure proto- the golden standard [70]. col along with an optimized version for single linkage clustering. Privacy-preserving unsupervised learning. Several critical Second, we explore the integration of our protocol with existing learning tasks, across a variety of dierent application domains approximation algorithms for hierarchical clustering, resulting in (such as healthcare or security analytics), demand deriving accu- a protocol that can eciently scale to very large datasets. Finally, rate ML models over highly sensitive and/or proprietary data. By we provide a prototype implementation and experimentally eval- default, organizations in possession of such condential datasets uate the feasibility and eciency of our approach on synthetic are le with no other option than simply running their own local and real datasets, with encouraging results. For example, for a models, severely impacting the ecacy of the learning task at hand. dataset of one million records and 10 dimensions, our optimized us, in the context of data analytics, privacy risks are the main privacy-preserving approximation protocol requires 35 seconds for impediment against collaboratively learning over large volumes of end-to-end execution, just 896KB of communication, and achieves individually contributed data. 97:09% accuracy. e security community has recently embraced the powerful con- cept of privacy-preserving collaborative learning (PCL), the premise being that eective analytics over sensitive data is feasible by build- 1 INTRODUCTION ing global models in ways that protect privacy. is is typically Big-data analytics is an ubiquitous practice already impacting our achieved by applying secure multiparty computation (MPC) or daily lives. Our digital interactions produce massive amounts of dierential privacy (DP) to data analytics, so that “learning” is per- arXiv:1904.04475v1 [cs.CR] 9 Apr 2019 data which are processed with the goal of discovering unknown formed over encrypted or sanitized data. One notable example paerns or correlations that, in turn, can suggest useful conclusions is the recently proposed privacy-preserving federated learning in and help making beer decisions. At the core of this lies Machine which model parameters are aggregated and shared by multiple Learning (ML) for devising complex data models and predictive clients to generate a global model [9]. Existing work on PCL al- algorithms that provide hidden insights and automated decisions most exclusively address supervised rather than unsupervised learning while optimizing certain objectives. Example applications success- tasks (with a few exceptions such as k-means clustering). As unsu- fully employing ML frameworks include market-trend forecast, pervised learning is a prevalent learning paradigm, the design of service personalization, speech/face recognition, autonomous driv- supporting ML protocols that promote collaboration, accuracy, and ing, health diagnostics, and security analytics. privacy in this seing is vital. Data analysis is, of course, only as good as the analyzed data, but In this paper, we present the rst formal study for privacy- this goes beyond the need to properly inspect, cleanse or transform preserving hierarchical clustering, featuring scalable cryptographic “high-delity” data prior to its modeling. In most learning domains, protocols that allow two parties to privately compute joint clusters analyzing “big data” is of twofold semantics: volume and variety. on their combined sensitive input datasets. Previous work in this First, the larger the dataset available to an ML algorithm, the space introduces several cryptographic protocols (e.g., [17, 39, 40]), beer the learning accuracy, as irregularities due to outliers fade but without rigorous security denition and analysis. Motivating applications. Hierarchical clustering is a class of un- A number of challenges, then, relate to making the cryptographic supervised learning methods seeking to build a hierarchy of clusters protocols scalable to large datasets. Hierarchical clustering is al- over an input dataset (called dendrogram), typically using the ag- ready computationally heavy with running cost O¹n3º (n being glomerative (boom-up) approach. Clusters are initialized with the size of training data). Standard methods for secure two-party single points and are iteratively merged using dierent cluster link- computation (e.g., Yao’s garbled circuits [81]) result in large com- age (e.g., nearest neighbor and diameter for single and complete munication, while use of fully homomorphic encryption [29] is linkage, respectively). is is widely used in practice, oen in appli- still prohibitively costly, rendering the design of PCL protocols for cation areas where the need for scalable PCL solutions is profound. hierarchical clustering a dicult task. Approximation algorithms In healthcare, for instance, hierarchical clustering allows re- for clustering such as CURE [35] is the de facto way to achieve scal- searchers, clinicians and policy makers to process medical data ability to massive amounts of data, however, incorporating both in order to discover useful correlations that can improve health approximation and privacy protection in secure computation is practices—e.g., discover similar groups of genes [23], patient groups far than obvious, and has not been considered previously in the most in need of targeted intervention [55, 79], and changes in health- literature. care costs as a result of specic treatment [48]. To be of any value, We address these challenges through some unique insights that such analyzed medical data contains sensitive information (e.g., enable us to design practical PLC protocols for hierarchical clus- patient records, gene information, or PII) that must be protected tering. First, in Section4, we devise our main constructions as (also due to legislations such as HIPPA in US or GDPR in EU). mixed protocols (e.g., [18, 36, 45]). We carefully decompose the Also, in security analytics, hierarchical clustering allows enter- computation needed in each iteration of hierarchical clustering prise security sta to process network and log data in order to into building blocks that can be eciently implemented via either discover suspicious activities or malicious events—e.g., detect bot- garbled circuits or additive homomorphic encryption [59]. We then nets [34], malicious trac [54], compromised accounts [12], and explore performance trade-os in our design space and select a classify malware [8]. Again, security logs contain sensitive informa- combination of building blocks that achieves fast computation and tion (e.g., employee/customer data, enterprise security posture, etc.) low bandwidth usage, wherein we employ garbled circuits for clus- that must be protected (due to industry regulations or for reduced ter merging and distance updates, but homomorphic encryption liability). for distance computation. As such, without privacy-protection provisions in place for col- Moreover, in Section5, we show how our design can be further laborative learning, data owners are restricted to apply hierarchical optimized, by eciently incorporating an existing clustering ap- clustering solely on their own private datasets, thus losing in accu- proximation algorithm [35] into our main framework, to achieve the racy and eectiveness. In contrast, our treatment of the problem best of both worlds: strong privacy guarantees and scalability. We as a PCL instance is a solid step towards employing the benets of explore several variants that exhibit dierent trade-os between hierarchical clustering over any collection
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages18 Page
-
File Size-