Privacy-Preserving Similar Patient Queries for Combined Biomedical Data
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings on Privacy Enhancing Technologies ; 2019 (1):47–67 Ahmed Salem*, Pascal Berrang, Mathias Humbert, and Michael Backes Privacy-Preserving Similar Patient Queries for Combined Biomedical Data Abstract: The decreasing costs of molecular profiling Received 2018-05-31; revised 2018-09-15; accepted 2018-09-16. have fueled the biomedical research community with a plethora of new types of biomedical data, enabling a breakthrough towards more precise and personalized 1 Introduction medicine. Naturally, the increasing availability of data also enables physicians to compare patients’ data and With the plummeting costs of molecular profiling, and treatments easily and to find similar patients in order in particular whole-genome sequencing, the amount of to propose the optimal therapy. Such similar patient personal biomedical data available is rapidly increasing. queries (SPQs) are of utmost importance to medical This new data availability has positively affected the practice and will be relied upon in future health infor- research in medicine by enabling the development of re- mation exchange systems. While privacy-preserving so- search directions that were previously impossible due to lutions have been previously studied, those are limited the lack of data. Whole-genome sequencing has enabled to genomic data, ignoring the different newly available biomedical researchers to identify the relations between types of biomedical data. a plethora of severe diseases and the genomic muta- In this paper, we propose new cryptographic techniques tions responsible for them. For instance, links have been for finding similar patients in a privacy-preserving man- found between some genomic variants and the risk of de- ner with various types of biomedical data, including veloping Alzheimer’s disease [1]. However, the genome genomic, epigenomic and transcriptomic data as well is only the tip of the iceberg concerning the current as their combination. We design protocols for two of biomedical data deluge. the most common similarity metrics in biomedicine: the Researchers have shown that aberrant levels of other Euclidean distance and Pearson correlation coefficient. types of biomedical data, such as epigenomic or tran- Moreover, unlike previous approaches, we account for scriptomic data, could indicate its owner carrying a se- the fact that certain locations contribute differently to vere disease, such as diabetes, neurodegenerative dis- a given disease or phenotype by allowing to limit the eases, heart diseases or cancers [2–4]. Two of the most query to the relevant locations and to assign them dif- important epigenetic elements in the human body, DNA ferent weights. Our protocols are specifically designed methylation and microRNAs (miRNAs), have been as- to be highly efficient in terms of communication and sociated with numerous diseases as well. For example, bandwidth, requiring only one or two rounds of com- abnormal DNA methylation patterns, such as hyper- or munication and thus enabling scalable parallel queries. hypomethylation, are often observed for cancer patients, We rigorously prove our protocols to be secure based on leading to the hyper-activation of genes such as onco- cryptographic games and instantiate our technique with genes, or the silencing of tumor suppressor genes [5]. three of the most important types of biomedical data – Dysregulation of microRNA expression has also been namely DNA, microRNA expression, and DNA methy- shown to be linked to several types of cancer [6]. Finally, lation. Our experimental results show that our protocols the combination of different kinds of biomedical data is can compute a similarity query over a typical number of an auspicious and growing direction of research. For in- positions against a database of 1,000 patients in a few seconds. Finally, we propose and formalize strategies to mitigate the threat of malicious users or hospitals. *Corresponding Author: Ahmed Salem: CISPA, Saar- Keywords: Biomedical data privacy land University, E-mail: [email protected] Pascal Berrang: CISPA, Saarland University, E-mail: pas- DOI 10.2478/popets-2019-0004 [email protected] Mathias Humbert: Swiss Data Science Center, ETH Zurich and EPFL, E-mail: mathias.humbert@epfl.ch Michael Backes: CISPA Helmholtz Center i.G., E-mail: [email protected] Privacy-Preserving Similar Patient Queries for Combined Biomedical Data 48 stance, Hamed et al. study the relation between breast queries based on two similarity measures widely used in cancer and the combination of various data such as DNA the biomedical context: the Euclidean distance and the methylation, microRNA expression, and genomic vari- Pearson correlation coefficient. ants [7]. Speicher and Pfeiger combine DNA methyla- Since the secure computation of Pearson correla- tion, microRNA expression, and gene expression for in- tion coefficient involves squaring of ciphertexts, we also tegrative analysis of tumor samples [8, 9]. propose an efficient extension for additively homomor- As biomedical data are increasingly available, it be- phic encryption schemes that enables the squaring of comes easier to compare patients’ data with each other. ciphertexts. An additional requirement on the encryp- This functionality is of utmost importance to be able to tion scheme is that it also supports scalar multiplication find similar patients and, e.g., check how they responded of ciphertexts. Our extension is based on the work of to different therapies. In this context, the Global Al- Catalano and Fiore [31], adapting their idea to squar- liance for Genomics and Health (GA4GH) has devel- ing and improving the efficiency in terms of ciphertext oped the MatchMaker Exchange [10], a platform that size and decryption time. We also prove our extension facilitates rare disease gene discovery through a fed- to be secure with regard to ciphertext indistinguishabil- erated network of genotype and phenotype databases. ity under chosen plaintext attacks. We instantiate our It typically answers requests such as “do you have any protocols and the proposed extension with the Paillier patients similar to one who has hypertelorism with a cryptosystem. deleterious variant in CASQ2”, where the similarity is We consider three different settings of information defined by the receiving system. However, due to the se- released to the parties involved in our protocols, thus rious privacy concerns related to genomic data [11–16], offering some flexibility to the system designer. These but also epigenomic and transcriptomic data [17–21], we range from both queried and querying parties learning anticipate that any similarity computation will have to only whether the similarity measure is below or above be performed privately in the near future, i.e., without a predefined threshold (boolean outcome) to one of the disclosing the patients’ raw biomedical data. two parties learning the final similarity measure and the While privacy-preserving similar patient queries other learning the boolean outcome. The latter case can have been previously studied in the literature, these so- be helpful when the querying party (e.g., a physician) lutions have been practically limited to whole-genome wants to know to which extent his patient is similar queries [22–24]. In this paper, we propose new methods to patients in the queried party’s database (e.g., in a that allow us to find similar patients based on various hospital’s database). All our protocols are designed with types of biomedical data, such as genomic, epigenomic, a focus on keeping the communication overhead low, and transcriptomic data, and to combine them if nec- running in only one or two rounds of communication essary. Moreover, our approach easily allows the users depending on the chosen setting. This allows for scalable to query specific locations instead of the whole genome. parallel queries, effectively reducing the number of open This is especially important as omics-based diseases and connections and the bandwidth on the hospital’s side. therapies are always related to a limited set of positions We rigorously prove our protocols to be secure in [25–30]. Moreover, it easily allows a medical practitioner the presence of honest-but-curious user and hospital. to find similar patients carrying a certain disease by Moreover, we provide an in-depth discussion and for- querying the relevant disease markers for the chosen set malize countermeasures to prevent inferences from ma- of biomedical data types. Finally, it is generic enough licious users or hospitals. Finally, we implement and to adapt to the functionality of any health information test the different protocols using various combinations exchange platform, such as the MatchMaker Exchange. of data. These show that we can query a typical num- ber of biomedical data positions against a database of 1,000 patients in less than five seconds for the weighted Contributions Euclidean distance protocol, and in less than thirty sec- In this paper, we introduce the first privacy-preserving onds for the Pearson correlation coefficient. This demon- similar patient query system that can handle differ- strates that our approach is efficient and highly practical ent biomedical data types, while enabling the users to in current clinical and research settings. query a specific subset of positions and to put different weights on these positions or data types. Our system relies on additively homomorphic encryption and imple- ments different protocols for performing similar patient Privacy-Preserving Similar Patient Queries for Combined