<<

ABSTRACT

PRIVACY PRESERVING KIN GENOMIC DATA PUBLISHING

by Hui Shang

The high availability of genome sequencing data and the advancement in data mining stimulate the progress of biomedical breakthroughs and the hope of personalized medicine. Meanwhile, the popularity of personalized and commercialized genome testing brings about increasing privacy concerns. Differential privacy, derived from cryptography, is a robust framework for sharing aggregate information while protecting an individual’s privacy. However, it is challenging to directly apply the differential privacy methodology for protecting genetic privacy, due to the specific characteristics of genome data: unique, Big Data, and kin-genomic dependency. In this thesis, we construct a genome data model to capture the characteristics of genomic data. We then design an attack algorithm by operating belief propagation on a factor graph. Finally, we develop a differentially private method that contributes to the privacy-preserving dissemination of genomic data. PRIVACY PRESERVING KIN GENOMIC DATA PUBLISHING

A Thesis

Submitted to the Faculty of Miami University in partial fulfillment of the requirements for the degree of Master of Science by Hui Shang Miami University Oxford, Ohio 2020

Advisor: Dr. He, Zaobo Reader: Dr. Inclezan, Daniela Reader: Dr. Bibak, Khodakhast

©2020 Hui Shang This Thesis titled

PRIVACY PRESERVING KIN GENOMIC DATA PUBLISHING

by

Hui Shang

has been approved for publication by

The College of Engineering and Computing

and

The Department of Computer Science & Software Engineering

Dr. He, Zaobo

Dr. Inclezan, Daniela

Dr. Bibak, Khodakhast Table of Contents

List of Tables v

List of Figures vi

Chapter Title vii

Chapter Title viii

1 Introduction1 1.1 Motivation ...... 1 1.2 Contributions and Goals ...... 2 1.3 Thesis Outline ...... 2 1.4 Dissemination of Results ...... 3

2 Background & Related Work4 2.1 Genetic Terms ...... 4 2.2 Probabilistic Graph Model ...... 6 2.2.1 Factor Graph ...... 6 2.2.2 Belief Propagation Algorithm ...... 7 2.3 Related Work ...... 8 2.3.1 Inference Attacks ...... 8 2.3.2 Privacy Preserving Genome Data Sharing Methods ...... 9 2.4 Comparison ...... 11

3 Inference Attacks On Kin Genomic Data and Their Mitigation 13 3.1 Introduction ...... 13 3.2 Preliminaries ...... 14 3.2.1 Genetics ...... 14 3.2.2 Differential Privacy ...... 16 3.2.3 Adversary Model ...... 17 3.2.4 Problem Definition ...... 17 3.3 Method ...... 18 3.3.1 Notations ...... 19 3.3.2 Construct of Kin Genome Data Model ...... 19

iii 3.3.3 Applying Belief Propagation ...... 20 3.4 Sanitization of Kin Genomic Data ...... 22 3.5 Metrics for Kin Genomic Privacy and Utility ...... 25 3.5.1 Inference Error ...... 25 3.5.2 Entropy ...... 26 3.5.3 Utility ...... 26

4 Validation 27 4.1 Experimental Design ...... 27 4.1.1 Experiment Environment ...... 27 4.1.2 Experimental Data ...... 27 4.2 Results ...... 28 4.2.1 Inference Attack ...... 28 4.2.2 Data Sanitization ...... 31

5 Conclusion 35 5.1 Summary of This Work ...... 35 5.2 Challenge ...... 35 5.3 Future Work ...... 35

References 37

iv List of Tables

3.1 Probability of a child’s genotypes, given parents’ genotypes ...... 14 3.2 Probability of a father’s genotypes, given his spouse and his child’s genotypes . . . 15 3.3 Conditional probability distribution of phenotypes ...... 16 3.4 Probability of genotypes distribution, given allele frequency ...... 24

4.1 Inference attack results ...... 29

v List of Figures

2.1 Example of a factor graph ...... 6

3.1 Genotype relationships among trio family ...... 15 3.2 Workflow for kin Genomic data inference attack ...... 18 3.3 Example of kin genome factor graph, representing a trio with 3 SNP and 2 traits per family member...... 19 3.4 Belief propagation on factor graph ...... 21 3.5 Example of addition of differential privacy noise ...... 23

4.1 Manuel Corpas Family Tree of the four family members. Females are represented by ellipses and males are represented as squares...... 28 4.2 CEPH/UTAH Pedigree 1463 of 11 individuals. Females are represented by ellipses and males are represented as squares. The family tree represents family relationships. 30 4.3 Evaluation of son privacy loss by inference error, given the genome of son’s different relatives ...... 30 4.4 Evaluation of adversary’s uncertainty for prediction of son, given the genomes of son’s different relatives ...... 31 4.5 Evaluation of r1 privacy loss by inference error, given the genome of r1’s different relatives ...... 32 4.6 Evaluation of adversary’s uncertainty for prediction of r1, given the genomes of r1’s different relatives ...... 33 4.7 Evaluation of DP noise in genetic privacy preservation by inference error . . . . . 34 4.8 Evaluation of DP noise in genetic privacy preservation by entropy ...... 34

vi Dedication

To my family, teachers, friends and this beautiful world.

vii Acknowledgements

It is a challenging but rewarding experience to study Computer Science, a new field that I did not think I would step in a few years ago. It provides me with excellent training and skillsets to broaden my career path. Moreover, it offers me the cherished opportunities to connect many dedicated faculty of the Department of Computer Science and Software Engineering (CSE), and sincere friends who care, support, and encourage me along the way. I would like to offer my special gratitude to Dr. He, Zaobo, my advisor. Thank him for his valuable instructions and professional guidance during the design and progress of this research project. He is very helpful, not only as an advisor but also as a good friend. I would like to express my great appreciation to Dr. Inclezan, Daniela, and Dr. Bibak, Khodakhast, my thesis committee members. They kindly provide for their constructive suggestions and encouragement in my study. They are knowledgeable and professional in both research and teaching. I would also like to thanks other faculties of CSE, Dr. Eric Bachmann, Dr. Md Gani, Dr. Angel Bravo-Salgado, Dr. John Femiani, Dr. Alan Ferrenberg, Dr. Eric Rapos, Dr. Chun Liang, Dr. Karen Davis and Dr. Mike Zmuda for their excellent lectures. I am grateful to my friends, classmates, and those who kindly offer assistance and encourage- ment. Thank Liu Xian and Xiaolin Liu for encouraging me to step out of my comfort zone and beginning my journey in the field of Computer Science. Thank Chitraketu Pandey for helping me implement the Belief Propagation Algorithm in Python for this research work and the collabora- tions for our team course projects. I enjoyed the time we discussed various coding questions and algorithms. Thank Iman Deznabi for providing some crucial datasets for experimenting. Thank Yefei Ye for helping me in the data preprocessing. Thank Shrawani Silwa, Zehua Lin, Zunchen Zhao, Minghua Li, Li Zhang, Yanxue Xie, Janelle Allen, and Shangye Chen for their support. I would like to give my special thanks to my family. My parents gave me life, and raised me, taught me to be kind and helpful towards others. They also teach me to be persistent and keep optimistic about life. I want to thank my husband, who accompanies me and supports me all the time. He spares no effort to make me happy. I am also grateful to my parents in law who care for us. I feel blessed to have connected with many lovely people. At a hard time of the COVID-19 pandemic, I hope everyone is fine and safe. "Love and Honor" by the code of Miami University.

viii Chapter 1 Introduction

1.1 Motivation

The development of genome sequencing technologies and the fast dropping price of sequencing has accelerated the rapid accumulation of genetic data. For example, the cost of sequencing a decreased significantly from over $ 1 million in 2007 to around $ 1 thousand in 2015 [1]. Meanwhile, a tremendous amount of human genome data are generated and stored. For instance, in 2017, the European Genome-phenome Archive (AGA) stored 5.85 petabytes (PB) of human genomic data at a 29.5% yearly increase in the total storage size [2]. Nowadays, researchers across the world have access to more than terabytes of genetic data through the websites supported by academic and research organizations such as the European Bioinformatics Institute (EBI) and the US National Center for Biotechnology Information (NCBI) [3]. Moreover, commercial genome service platforms such as 23andMe [4], OpenSNP [5], and PatientsLikeMe [6] allow individuals/customers to obtain their personal genetic/health information, and release their data. Publishing and sharing these human genomic data and genetic discovery are essential for better health. With the perspective of researchers, the availability of human genome sequences stimulates the progress of team-oriented interdisciplinary projects and biomedical breakthroughs for human evolution, disease diagnostics, and therapies [7]. Big Data analytic allows researchers to reveal the associations between genetic information and disease. For individuals, the genome data help them to learn their risk of diseases, their ancestry, and receive personalized medicine for better treatment [8]. One major issue with genomic data publishing is that the release of a person’s genetic in- formation constitutes a threat to their privacy. For example, an attacker is capable of gaining sensitive information by computing an individual’s disease susceptibility from single nucleotide polymorphisms (SNPs) in a published genetic database such as Genome-wide Association Studies (GWASs) Catalog [9]. In this way, the privacy of participants and even their family members will be adversely affected when their sensitive information is exposed. The leakage of personal sensitive genetic information could bring harm to many aspects of their lives, such as financial situation, social status, and family relationships. For example, they may face discrimination from health insurance or employer if their genomic data are revealing their higher risks of certain diseases. Moreover, with the increasing concern of privacy and the risk of genetic information breaches, people will be unlikely to share their genetic data or participate into biomedical research such as GWAS. This, in turn, will impede the progress of personalized medicine and biotechnology discovery. Nevertheless, it is a challenge to preserve human genome data privacy. For one reason, human

1 genome data is one type of Big Data. Human genome data has large volumes and are very complex. The data from a single complete human genome sequencing takes about 140 gigabytes (GB) storage space [3]. Among it, the genome sequencing data typically is around 100 GB in the form of DNA sequence alignment map files [10]. In addition to DNA sequences and their annotations, genome data for GWAS also includes a broad range of other types of side information, such as expression, medical measurements, phenotypes, diseases, physical address, and family relation. Even though human genetic data sets are de-identified (removing unique personal identifiers) before publishing, research has demonstrated that this approach is not sufficient to protect genetic privacy [11][12]. Attackers with a solid background in data mining and genetics can infer people’s identity, attributes, complete genome information even from the partial genome data and open resources [13]. Another primary reason is that the genome data of all family members highly correlated. Therefore, leaking one member’s genetic information can result in leakage of other members’ data. There is a kin-genome dependency relationship among relatives. The issue of kin privacy in genomic data sharing is initially raised in 2008 by Stajano et al. [14]. To address the conflict between genome data sharing and protecting genetic privacy, an emerging active scientific community has taken initial efforts over the past few years. However, the majority of current modeling strategies do not consider the kin-genomic dependency relationship. In this work, we have studied inference attack with kin-genomic dependency. We have proposed a framework for the addition of differential private noise to protect an individual’s genetic privacy.

1.2 Contributions and Goals

The ultimate goal of this proposed study is to establish a privacy-friendly and data fidelity-preserving framework for privacy-preserving kin genome data publishing. To achieve this goal, we make the following contributions:

• Create a probabilistic graphical model for integrating SNPs, family relationships, and complex GWAS statistics with SNPs-trait association

• Develop an inference attack algorithm on unknown SNPs or traits

• Formulate a countermeasure against inference attacks to achieve a trade-off between kin- genome publishing and privacy

1.3 Thesis Outline

This thesis is structured as follows. Chapter 2 presents the fundamental knowledge and related work to the prepare potential readers with the research topic. Chapter 3 describes the research problem and proposed framework, including the model and algorithm to be used. Chapter 4 presents the evaluation and validation process and results. Chapter 5 concludes the findings and presents thoughts about future work.

2 1.4 Dissemination of Results

The Portions of this thesis about inference attack are published and orally presented in the 2019 IEEE 16th International Conference on Mobile Ad Hoc and Smart Systems: The First International Workshop on Machine Learning Security and Privacy: Experiences and Applications Workshops at Monterey, CA, USA, in November, 2019.

3 Chapter 2 Background & Related Work

This chapter briefly introduces background including important genetic terms and probabilistic model related concepts (factor graph and belief propagation), and related work. A brief comparison of related work is also presented.

2.1 Genetic Terms

• Genome A genome is the complete set of genetic information of a living organism. This information is essential for an organism to build its structure and body, and to function. It can be in forms of DNA (deoxyribonucleic acid) or RNA (ribonucleic acid) which composed of long chains/strings of four different nucleotides. Genomes of most organisms take shape in DNA, while some kinds of viruses use RNA as their genomes to store genetic information. Four kinds of nucleotides in DNA share one same five-carbon sugar and phosphate groups but have different nitrogenous bases (adenine(A), cytosine (C), guanine (G), and thymine (T)) [15]. The A-T and G-C form a , leading to a stable double-helix structure of DNA [15]. Therefore, DNA can be represented as a string over value {A, T, G, C}. • Human Genome The human genome is encoded by double-stranded DNA, making up 23 pairs of inside of cell nucleus. Human germ cells contain haploid human genomes which are made of more than three billion base pairs, while somatic cells have diploid genomes which have two sets of genome DNA. A whole single human genome includes 20,000 to 25,000 in total. All human beings share the identical 99.9 percentage genome DNA [16]. • Allele Many genes have various forms, which are located at the same position (genetics ). Traditionally, an allele is defined as one of the variant forms of a gene [17]. In addition to germ cells, humans cells have two alleles, which are inherited from two parents, respectively. Therefore, one person’s genomic data is highly correlated among family members. Each pair of alleles determines one’s genotype, which is fundamental to one’s phenotype. For example, if B is the dominant or major allele while b is the recessive/minor form, the individual can have a pair of the alleles take values in {BB, Bb, bb}. The one with either genotype BB or Bb expresses the same dominant phenotype. By contrast, the recessive phenotype expressed when both alleles are recessive, such as bb.

4 • Single Nucleotide Polymorphism (SNP) A single nucleotide polymorphism (SNP) is defined as a variation of a nucleotide at a specific location in genomes [18]. Each SNP refers to a different nucleotide at the same genetic locus. For example, nucleotide thymine (T), which is substituted by nucleotide cytosine (C) is a SNP. A common SNP is one kind of SNP that affects more than one percent of populations. SNP is the primary source of the difference in alleles, genotypes, and phenotypes. For a pair of SNPs, the SNP that is carried by highest frequency in a population is termed as the major allele, designated as B. Whereas, the other one which is the second most common in a given population is named as the minor allele representing by b, and its frequency is also named as Minor Allele Frequency (MAF) [19].

Most SNPs do not have an obvious influence on human health and life. However, some SNPs are found to affect people’s susceptibility to environments, response to medicine, de- velopment of certain diseases, or phenotypic characteristics (traits). The SNP alleles that are associated with a high risk of certain diseases termed as risk alleles, and the corresponding frequency detected in a population is known as Risk Allele Frequency(RAF) [20]. Investi- gation and analysis of an individual ’s SNPs from one’s genome will be capable of providing the potential evaluation of genetic predisposition to form a trait or develop a disease. For example, people carrying two SNPs (rs429358 and rs7412) within the Apolipoprotein E gene have a higher probability of developing Alzheimer’s disease [21]. Therefore, SNPs are sensitive information from the perspective of genetic privacy.

• Linkage Disequilibrium (LD) Linkage disequilibrium is defined as the correlation (non-random association) between pairs of alleles or SNPs at different genetic loci in the whole genome in a population [22].

Assume there is a strong LD between two SNPs, the content of one SNP information can be used to infer the content of the other SNP. Thus, the highly correlated DNA se- quences/SNPs/family genome can result in interdependent privacy risk.

• Mendelian Inheritance Mendel’s “First Law ”, also known as “Law of Segregation of genes”, states that every diploid organism (two sets of a genome) have two alleles as a pair located at the pair of chromosomes for each trait. These alleles are independent of each other and segregate during meiosis (cell division to generate haploid gamete for reproduction) such that each gamete carries one of the two alleles at the equal probability (0.5) [23]. For examples, if both parents have a pair of alleles {B, b}, their child’s alleles take values in {BB, Bb, bb} with the probability 0.25, 0.5, 0.25, respectively. Here, the probability of the child’s alleles to be {Bb} is calculated as follows:

P r(Bb) = P r(B|Bb)P r(b|Bb) + P r(b|Bb)P r(B|Bb) = 0.5 × 0.5 + 0.5 × 0.5

5 • Genome-Wide Association Studies (GWAS) Genome-Wide Association Studies exploit the association between genetic information such as SNP and genes, human physical traits, and disease. GWAS reveal that many SNPs associated with human diseases, such as Alzheimer’s disease, schizophrenia, type-1 diabetes, and inflammatory bowel disease [24].

The NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) is an open and free database of published SNP-trait associations [9].

2.2 Probabilistic Graph Model

2.2.1 Factor Graph A factor graph is one type of probabilistic graphical models, which include Bayesian networks and Markov random fields. However, Bayesian networks and Markov random fields can be transformed into factor graphs. A factor graph is a bipartite and undirected graph representing the factorization of a function. In a factor graph, there are two types of nodes: variable nodes representing each variable and factor nodes displaying the relationships between variables or each function. Edges in factor graph are connecting between a factor node and a variable node if the function depends on the variable. Given a factorization: Y g(X1,X2, ..., Xn) = fs(Xs) s where Xs ⊂ {X1,X2, ..., Xn}, fs denotes a function of corresponding set of variables Xs. Consider, for example, a join distribution that is expressed in factorization:

p(X) = g(X1,X2,X3) = fa(X1,X2)fb(X2,X3)fc(X3) we can obtain the factor graph (Figure 2.1):

Figure 2.1: Example of a factor graph

Factor graphs are useful because they can help to visualize the structure of a problem. They provide direct visualization of dependencies among variables, and also split a complicated function

6 of multiple variables into simple functions. There are efficient and exact inference algorithms that can apply to factor graphs, including the belief propagation algorithm (the sum-product algorithm) and the max-sum algorithm.

2.2.2 Belief Propagation Algorithm Belief propagation (also known as sum-product message passing), is an inference algorithm on graph models, including factor graphs, Bayesian networks, and Markov random fields. The belief propagation algorithm allows for efficiently find marginal probability distribution (MPD) over the component variables of a graph model. To find the MPD of x, the joint distribution over all variables except x is summed out: X p(x) = p(x) X\x where X\x denotes the subset of variables in X with x removed. The time complexity of using this way to find the MPD is exponential computational complexity over the number of variables. However, in the context of belief propagation, the MPD of x: Y p(x) = Fs(x, Xs) s∈ne(x) where ne(x) denotes the neighbour nodes (factor nodes) of x, and Xs denotes the set of variables which are connected to the variables node x through the factor node fs and Fs(x, Xs) denotes the product of all the factors in the group associated with factor fs [25]. The belief propagation algorithm operates in three steps as follows [26]:

Step 1: initialization Send messages from the edges (leaf nodes) of the graph:

• messages from the edge factor nodes to the variable nodes [25]:

µf → x(x) = f(x)

• messages from the edge variable nodes to the factor nodes [25]:

µx → f(x) = 1

Step 2: message passing Compute outgoing messages when incoming messages are available:

• message from variable to factor: product of all incoming message [25]: Y µxm → fs(xm) = µfl → xm(xm)

l∈ne(xm)\fs

7 • messages from factor to variable: product of all incoming messages and factor, sum out previous variables [25]: X X Y µfs → x(x) = ... fs(x, x1, ..., xM ) µxm → fs(xm)

x1 xM m∈ne(fs)\x

Step 3: termination A marginal distribution is the product of the incoming messages to the variable node: Y p(Xs) = fs(Xs) µxi → fs(xi)

i∈ne(fs) In summary, the rule of belief propagation is that the message sent from a node v on edge e is the product of local function at v with all messages received at v on edges than e, summarized for the variables not associated with e [26]. Using belief propagation on factor graphs can find the MPD in linear complexity.

2.3 Related Work

This section briefly introduces two aspects, namely, inference attacks and protection, relevant to genetic privacy.

2.3.1 Inference Attacks Genetic privacy breaching frameworks are classified into three categories: identity inference attacks, traits/attribute inference attacks, and completion attacks [13]. However, they are not mutually exclusive. One inference attack can involve two or more types of inferences to disclose several sensitive information about targets. The inferences attacks discussed here focus on applying techniques that involve data mining, statistical analysis, probabilistic graph models, and machine learning, instead of access control and web security.

Identity inference attacks Identity inference attacks is an active area in privacy risk studies related to genome privacy. Identity inference attacks refer to recovering the unique identities of the DNA donors from the published genomic data, which probably remove the unique identification information of the individuals participating. An adversary with a sound background in genetics and data mining is capable of launching an identity attack using quasi-identifiers in a published DNA database. There are many reported techniques for identity inference attack, including surname inference [11], DNA profiling and phenotyping [27], demographic identifiers based attack [28], pedigree structure based attack [29], side-channel leaks [30][13]. The work in [11] proposes a technique in which surnames can be re-identified from genomic data by mining short tandem repeats on the Y from genealogy databases. It also

8 explains how to use other types of auxiliary data from open internet resources along with surname to recover the identity of the target. A few studies demonstrate that profiling DNA sequencing results based on Mendel’s inheritance, SNPs and short tandem repeat sequences can successfully predict several visible phenotypes/traits such as eye color, hair color, and age, which can be used for obtaining the target’s identity information [31][32][27]. Mathias et al. proposed an attack to infer the genomes of relatives from one individual’s genome based on LD (linkage disequilibrium) pattern, factor graphs, and belief propagation [33]. Their work does show not only the potential of factor graph and belief propagation but also the high-level risk in kin genome Privacy.

Attribute(SNPs or traits) inference attack Probabilistic graphical models can be employed for genetic association studies, such as linkage disequilibrium (LD) pattern classification, genotyping and GWAS [34]. Schadt et al [12] infer individual SNP genotypes from data by RNA profiling and Bayesian approach. Zhang et al [35] provide a method of constructing Bayesian networks from GWAS statistics to capture the conditional dependency between SNPs and traits. Wang et al [36] discovered the conditional, dependent relationship between traits and SNPs using a Bayesian network from the public GWAS catalog data. Based on their constructed Bayesian network, they revealed the phenotypes and traits of targets from the public available GWASstatistics (e.g., allele frequency from the GWAS catalog).

Completion attacks Completing the genomic information from the target sequencing data, which is partial (incomplete), is defined as genotype imputation [37]. In a study, an efficient and accurate genotype imputation strategy has been developed for imputing missing SNP genotypes into the estimated haplotypes based on LD, and the genetic information from the 1000 Genomes Projects [38]. In the aspect of an adversary, the same strategy can be used to infer the hidden or missing sensitive and private genotypic data. For instance, a study indicated that it is feasible to infer apolipoprotein E (ApoE), which is associated with Alzheimer’s disease, from the released Dr. James Watson’s genome information, even though this gene and its flanking regions were blanked out before the release [21]. In addition to completion attack on masking sensitive DNA regions, another study shows that it is possible to infer the relatives’ haplotypes from the known LD models, genealogical information and the individual genome data using the statistical techniques for genotype imputation[39].

2.3.2 Privacy Preserving Genome Data Sharing Methods There are many countermeasure models against piracy issues related to genomic data, such as data anonymization/sanitation, access control [13], and optimization-based solutions [40]. The

9 remaining section discusses two active categories of mitigation solutions in the genetic privacy- preserving field: cryptographic solutions and data sanitation mediated solutions.

Cryptography Based Solutions Cryptography based solutions allow analysts to manipulate the exact genomic data. This way ensures the utility and accuracy of data. One investigation proposes that applying homomorphic encryption for the privacy-preserving sequence alignment of DNA sequences [41]. In this work, short alignment maps are stored in an encrypted way, while the cryptographer keys are stored at the masking and key manager. The cloud is used for computation without the key; therefore, the plain genotype information secure. However, as a result of this work, the generated system can handle at most 200 users’ requests, which is not amenable for the use of large-scale data sharing at the moment. Another work proposes an approach that supports the high-throughput privacy-preserving alignment of human genome data on hybrid clouds, including both the public cloud and the private cloud [42]. Their method is adapted from the seed-and-extend method [43] involved seeking exact hash by the public cloud and extending the seed by the private cloud for sequence mapping. However, up to date, this framework focuses on genome sequencing analysis, and haven’t extended to the field of privacy-preserving genome data publishing. The cryptographic approach provides a secure framework to allow organizations to contribute encrypted DNA sequence and side information into clouds or centralized data repository, where key holder organizations with the private key can obtain quires and decrypted results. The current cryptographic solution is useful for genomic data mining using encrypted genomic sequences. The limited number of key holders in nature restrains the scale of global genomic data sharing. Moreover, the current investigation shows that the public available GWAS datasets and statistics still can be used by an adversary to infer attributes or traits, even without genome sequences [44].

Data Sanitation Methodologies k-anonymity Mediated Approach K-Anonymity is a popular method widely used for protecting privacy. According to the principle of k-anonymity: “ blending in a crowd”, all records are classified into a limited number of equivalence groups. In this kind of data sanitation method, each record is expected to be indistinguishable from the other k − 1 records. These k records constitute an equivalence class in which attribute values are generalized. To achieve k-anonymity, it is required to conduct generalization by replacing with less specific values or suppression through removing outliers [45]. In a few studies, a method named “DNA Lattice Anonymization” based on the k-anonymity principle has been used to create generalized DNA sequences to obfuscate an individual’s DNA sequence through clustering [46][47]. However, k-anonymity cannot provide privacy if attackers have background knowledge or if sensitive values lack diversity [48]. Moreover, there is no accurate and useful method to quantitatively evaluate and validate the level of privacy protection based on k-anonymity.

10 Differential Privacy Based Methods Differential privacy is an emerging robust framework for designing privacy-preserving algorithms for data publishing [49]. Differential privacy operates by injecting noise or randomness to the released data [50]. The goal of differential privacy is to maxi- mize the accuracy of queries from databases while minimizing the privacy impact on the individual whose information is in the database. In the mathematical context, differential privacy aims to guarantee that no single individual’s attributes can significantly impact the output of the released statistical database [51]. In other words, the released data will be statistically indistinguishable from a similar one, which drops a single individual’s record. Therefore, an adversary, even with arbitrary prior knowledge, cannot use the query output to infer inputs accurately. In the mathematical context, for a randomized data release algorithm A with -differential privacy, for any two neighboring data sets D and D’, and for any possible synthetic data set D*, P r[A(D) = D∗] ≤ exp() P r[A(D0) = D∗]

where D be the original data set, and D’ be the data set with any single individual’s record deleted [50]. The privacy parameter  quantifies the difference in the output distributions from neighboring data sets and hence the level of privacy breach. However, in the context of genomic data protection, applying differential privacy is a challenge. Since genomic data typically have a large number of dimensions, current techniques require the addition of massive noise to the data in order to guarantee differential privacy. A large amount of noise may compromise the data utility. Several research projects have explored the application of differential privacy on genomic data release. Caroline Uhler et al. developed a technique that allows the release of some summary statistics such as the average minor allele frequencies, chi-square-statistics and p-values, which involve the SNPs that most relevant to diseases without compromising the privacy of those in the data set [52]. Aaron Johnson and Vitaly Shamtikov proposed the differential-privacy data mining algorithms for the release of common GWAS data [44]. Their techniques support computing of the number and locations of SNPs that are associated with diseases while the analyst is not required to know which SNPs or statistical tests to consider. Yu et al. applied differential privacy to the real human GWAS data collected by the Welcome Trust Case Control Consortium [53]. Their methods are dependent on the Laplace mechanism and the exponential mechanism [54]. To deal with the high dimension of genomic data, Wang et al. developed a novel approach based on top-down specialization while satisfying differential privacy [55]. Although the above studies provide algorithms to ensure differential privacy, the accuracy and utility of public GWAS data are compromised due to non-trivial amounts of noise.

2.4 Comparison

Modeling the probabilistic dependency between SNPs and attributes is helpful to discover the association between SNPs, traits, and family members, thus promoting the progress of genetic diagnostics and personalized medicines.

11 Mimicking a powerful adversary to attack genomic data is an important step to investigate how to preserve privacy. Current inference attack studies show the potential of employing probabilistic models, data mining, and machine learning to disclose sensitive information from public genetic databases with other auxiliary information. Particularly, many projects used a Bayesian network model to infer targets’ traits and identity. Nevertheless, the direction of the Bayesian network constraint the real dependency relationship among large numbers of variables from genome in- formation. It restricts the type of predicates or targets that an attack actually can select. As an undirected model, a factor graph does not have such a limitation. Another problem with Bayesian networks is that it has to determine the prior distribution an adversary has. Moreover, the belief propagation algorithm on the factor graph allows us to reduce the computational complexity for the marginal distribution of component variables from time exponential to linear in the size of variables. Therefore, we use factor graph and belief propagation for the stimulation of the inference attack. Another limitation of most current inference strategies is that they do not consider the kin- genomic dependency relationship. Our work takes the kin-genomic dependency relationship into consideration. Applications of cryptography based solutions to prevent genetic privacy breaching have a distinct advantage. That is an assurance of the accuracy of the genome data, which ensures the use of data. However, current cryptography methods limit themselves in data operation in a predetermined cloud or data controller, whereas data publishing globally is not facilitated. Also, their application scopes are narrow. Up to date, research that implements cryptography based solutions focus privacy preserving genome data mining, including sequence alignment and mapping, have been investigated. On the other hand, several studies have used k-anonymity for protecting personal genetic information. The strength of this method is its low computation complexity and relatively lower utility loss. However, the disadvantage of the k-anonymity mediated solution is that it could not define the quality of privacy and requires to presume the background knowledge of an attacker. By contrast, differential privacy based solution is independent of side information adversaries master. It provides a quantitative measurement for privacy level. However, applying this method in genetic privacy is at the severe cost of data utility due to the nature of genome data such as high dimensions. As a result of this study [54], the released output contains non-trivial amounts of random noise.

12 Chapter 3 Inference Attacks On Kin Genomic Data and Their Mitigation

3.1 Introduction

As the advancement of biotechnology and genome sequencing for the past two decades, the cost of sequencing the whole genome of individuals drops to one thousand dollars from multi-million dollars. The amount of genome data generated nowadays by research institutes/organizations, and personal testing service are surging. There has been an unprecedented growth in the genome data available in public databases such as the Gene Bank at NCBI (National Center for Biotechnology Information) and European Nucleotide Archive at EMBI (European Bioinformatics Institutes) and other online open sources (e.g., OpenSNP.come and 23andMe.come). With the decoding of the genome sequence, it has been well recognized that a whole set of individual genome sequences serves as a blueprint for building a human being. In other words, it carries enormous information about each person. This information, in turn, facilitates the utility of genome information in various areas, including but not limited to personalized medicine, prediction and diagnosis of genetic diseases, genetic fingerprinting, DNA forensics, Genetic ancestry, and paternity testing [56]. The sharing of genome data among biologists, data scientists, physicians, and others across the world contributes significantly to the advancement of biotechnology and medicine, thus bringing valuable positive impact to human life and public health. Nevertheless, genome data also contains sensitive information. Releasing or sharing of this information can constitute privacy and ethical threats to individuals. For example, a powerful attacker can learn people’s disease susceptibility, ancestry, and identity from genomic data. Once the owner of the genome is revealed, he/she would have to face the risk of discrimination in job markets and insurance. It will raise concerns for people to share genome data and participants of the genome-wide associate study, impeding the use of genome data for personalized medicine and other biomedical fields. Facing these issues, genetic privacy protection laws such as the Genetic Information Nondiscrimination Act and Health Insurance Portability and Accountability Act have been ordered. Moreover, conventional data sanitation methods such as anonymity and removal of some sen- sitive DNA information from the databases are put in practice. However, current data sanitation methods fail to prevent inference attacks on genomic data with the increasing availability of genomic data and their linkage to metadata across the internet [57]. Besides, the genomes of biological members of the family are highly correlated. Therefore, even though an individual’s genome is not available online at all, exposure of his/her family members’ genomes data can cause leakage of his

13 sensitive information. In this thesis, we have proposed a framework for inference attacks on kin genome data. In particular, we assume an adversary has background knowledge in genetics, some members of genome data, genome-wide association studies statistics. This adversary aims to predict SNP (significant components of genome variation) and traits (phenotype and disease risks) of target individuals from a family. Having a better understanding of inference attacks from the standing point of attackers is essential for quantification of kin genomic privacy. Moreover, it is a prerequisite to developing a privacy-preserving algorithm that can achieve an appropriate balance between utility and risk of genome data publishing. Furthermore, we have proposed a sanitation method that guarantees a differential private budget.

3.2 Preliminaries

3.2.1 Genetics Inheritance While human being shares 99 % genomic information in common, the variations among popula- tions mostly come from SNP (single-nucleotide polymorphism) as described previously in Chapter 2. According to the Mendelian scheme of inheritance (as described in Chapter 2), we can obtain the probability distribution of child’s genotype for a SNP given his/her parents’ genotypes (Table 3.1) and the probability distribution of a parent’s genotype for a SNP given his/her spouse and child’s genotypes (Table 3.2 ). The three numbers in the table entry within parentheses show the probabilities for a SNP take a value of (BB, Bb, bb). The dependencies of SNPs among a trio (father, mother, child) family can also be modeled by Bayesian networks as shown in Figure 3.1 or by factor graph ass shown in Figure 3.3.

Father Mother BB = 0 Bb = 1 bb = 2 BB = 0 (1, 0, 0) (1/2, 1/2, 0) (0, 1, 0) Bb = 1 (1/2, 1/2, 0) (1/4, 1/2, 1/4) (0, 1/2, 1/2) bb = 2 (0, 1, 0) (0, 1/2, 1/2) (0, 0, 1)

Table 3.1: Probability of a child’s genotypes, given parents’ genotypes

SNP/Trait Association Statistics The association between SNP and traits does not imply causation. In many conditions, a particular allele of SNP is significantly associated with disease due to its strong association with other factors that causes diseases.

14 Mother Child BB = 0 Bb = 1 bb = 2 BB = 0 (1/2, 1/2, 0) (1/2, 1/2, 0) (0, 0, 0) Bb = 1 (1/2, 1/2, 0) (1/3, 1/3, 1/3) (0, 1/2, 1/2) bb = 2 (0, 0, 0) (0, 1/2, 1/2) (0, 1/2, 1/2)

Table 3.2: Probability of a father’s genotypes, given his spouse and his child’s genotypes

Figure 3.1: Genotype relationships among trio family

Odds Ratios (OR), in the context of Genome-Wide Association Studies (GWAS), quantifies the strength of association between SNPs and phenotype. The phenotype includes traits such as eye colors and blood types, Mendelian diseases, and complex disorders, including heart disease, diabetes, and cancers. OR is defined as the ratios of the odds having one specific allele (nucleotide N) of a SNP having traits K and the odds having N of the SNP without traits K. If OR is equal to 1, it indicates that there is no association between SNP and traits. If OR is greater than 1, it implies that the corresponding allele of the SNP is associated with the traits. The larger OR, the stronger association. By contrast, if OR is less than 1, it shows that SNPs and traits are negatively correlated. Assume SNP i, and its risk allele is b (non-risk B), we can get the OR for trait j Ob and risk allele frequency (RAF) in the control group fbc from GWAS catalog. We can calculate the risk frequency in the case group fbe according to the following equation [36]:

Ob × fbc fbe = (3.1) Ob × fbc + 1 − fbc

With the RAF in the case group fbe and RAF in the control group, the probability distribution of genotypes given its association trait j can be determined as the following table 3.3[58]: However, it is not appropriate to directly use OR of a risk allele (mostly minor allele) of SNP to an individual. We need to convert it to the ratios against the average population. Thus, the prevalence of traits, as a prior, can be used in combination with OR to determine the probability of disease given an individual’s genotype (SNPs information). If the disease is rare and prevalence is less than 0.1%, then the odds ratio is approximately equal to the genotype relative risk ratio.

15 j Not j 1 1 BB (1 − fbe) 2 (1 − fbc) 2 Bb fbe(1 − fbe) fbe(1 − fbc) 1 1 2 2 bb fbe fbc Table 3.3: Conditional probability distribution of phenotypes

3.2.2 Differential Privacy Differential privacy (DP) is a powerful tool for the protection of data privacy. It helps to deal with the issues of individual privacy breaches with increasing data available while allowing for data sharing and utility. The strength of DP is bounded by a privacy budget, represented by . An algorithm A satisfies - differential privacy, for any neighboring databases D, D’ [59]. P r(A(D) ∈ S) ≤ e (3.2) P r(A(D0) ∈ S) where S is the output, D and D’ are neighbouring databases means that they have only one tuple is different from each other. The smaller  implies that the more robust protection. It is harder for attackers to predict whether individual information of interest is in a dataset when  decreases. Sensitivity. For any query function:

f : D → Rd

where D is a datasets, and Rd is a d-dimension real-valued vector, the global sensitivity is defined as 0 ∆f = max ||f(D) − f(D )||1 (3.3) 0 where D and D are neighboring datasets. k.k1 is l1 norm. If Equation 3.3 takes the max, it will D,D0 get global sensitivity. If Equation 3.3 chooses the max, it will measure the local sensitivity. D0 The large value of sensitivity, the larger amount of noise to be added, resulted in higher privacy but lower utility. In a word, the DP noise, is controlled by both  and the sensitivity of a query function ∆f. Laplace mechanism and Exponential mechanism are two widely used methods for achieving DP [59].

Laplace Mechanism The Laplace mechanism in the context of differential privacy refers to the addition of Laplace- distributed random noise (Laplace noise) to data. Applying the Laplace mechanism is a standard method in the field of differential privacy to protect individual privacy. As its name indicating, Laplace noise is produced from the Laplace distribution. Its probability density function is:

16 1 − |x−µ| f(x|µ, b) = e b (3.4) 2b where µ is mean value, 2b2 is the variance. For any function f : N |x| → Rk,if an algorithm M add noise η to the true output f(D) :

M(D, f(.), ) = f(D) + η (3.5) = f(D) + Lap(∆f/) where η is generated from a Laplace distribution, with the location parameter(mean)µ equals 0, and the scale is ∆f/. Then this algorithm M guarantees -differential privacy [50].

Exponential Mechanism Laplace mechanism is widely used for data privacy but only applies to function that produces nu- merical values. In cases when the output of a function F is categorical, the exponential mechanism can be applied [60]. The principle of the exponential mechanism is to create a deferentially private version of F by sampling from F’ output domain Ω. Given a dataset D, the exponential mechanism samples ω ∈ Ω with a probability proportional to

u(D,ω) e 2∆u (3.6) which satisfies -differential privacy [60]. In Equation 3.6, u denotes a user-specified score function, which measure the quality of ω. ∆u is the sensitivity of the utility score u:

∆u ≡ max ||u(D, ω) − u(D0, ω)|| (3.7) ω∈Ω

0 0 D,D : ||D − D ||1 ≤ 1

3.2.3 Adversary Model The goal of an adversary is to infer sensitive unknown SNPs or traits of a target individual. Let us assume the adversary has a solid background in genetics and data mining. From the public online available data, he can extract the following information: one or more biological family members’ of target individual genomic data; family trees; SNP-traits association statistics from GWAS reports; the minor allele frequency (MAF) of SNP (the frequency of the second most common SNP in a given population).

3.2.4 Problem Definition This research problem can be briefly described as follows: Input:

• Genomic data of some family members, presented in the format of SNPs

17 • Family Tree representing a family relationship

• Minor allele frequency (MAF) of SNPs

• SNP/trait association statistics Output:

• Prediction of unknown SNP or traits/phenotypes of target family members

• Algorithms for inferring SNP or traits

• Methods for sanitizing kin genome data

3.3 Method

In this section, we have proposed an approach constructing a comprehensive factor graph that integrates partial genomic data of family members, GWAS Statistics, and family relationships. We also present the framework to performs belief propagation on this factor graph to predict the unknown SNP or traits. We have presented the evaluation metrics for quantification of kin genomic privacy loss. The overview of the inference attack is shown in Figure 3.2. Lastly, we introduce the method of sanitizing genomic data by exponential mechanism-based differential privacy.

Figure 3.2: Workflow for kin Genomic data inference attack

18 3.3.1 Notations Let us assume that a target family F with m (|F | = m) family members release the partial genome data of its family members and its family tree to the public. The sets of the SNPs of individuals of the target family is represented by S, where the genotype of SNP si ∈ {S}, and the value of si ∈ {BB, Bb, bb}. For simplicity, the genotype labels BB, Bb, and bb are encoded by "0", "1" , "2", respectively, in our data files. This is consistent with their index in the list/array for experimenting. The known SNPs of individuals of the target family are denoted by SK , whereas SU denotes the unknown SNPs.

3.3.2 Construct of Kin Genome Data Model Probabilistic Graph Models are commonly used to present conditional dependence and probability distribution. A factor graph is a type of probabilistic graph models which is bipartite and undirected. It contains two kinds of nodes: factor nodes and variable nodes. To construct the kin genomic factor graph, we set two types of variable nodes: SNP variable nodes and trait variable nodes; and two types of factor nodes: familial factor node, and SNP/trait association factor nodes. The familiar factor node integrates familiar relationship and the Mendelian inheritance dependence of SNP among family members, as shown in Table 3.1. Figure 3.3 shows a simple example of kin genomic factor graph, representing a trio (father, i i i mother and one child). In Figure 3.3, there are 3 SNP variable nodes S = {s1, s2, s3} and 2 trait i i variables nodes T = {t1, t2} for each family member i; SNP/traits factor nodes g represent that i i i i i i {s1, s2}, {s2, s3} are associated with t1 and t2, respectively.

Figure 3.3: Example of kin genome factor graph, representing a trio with 3 SNP and 2 traits per family member.

19 3.3.3 Applying Belief Propagation One of the research problems is inferring unknown sensitive attributes (traits and SNPs) given some SNPs and traits, GWAS statistics, and biological relationships. Therefore, the inference attack can be formulated to calculate the marginal probability distribution (MPD) of target unknown variables, with the observation of known SNP, traits, familial relationships, and GWAS SNP/traits association statistics. Therefore, the marginal probability distribution of the specific target variable xi can be deter- mined based on the joint probability distribution of unknown variables XU : X p(xi|SK ,TK ,F, Λ) = p(XU |SK ,TK ,F, Λ) (3.8)

XU \xi where XU = SU ∪ TU , XU represents unknown target traits or SNPs; XU \xi represents all unknown variables except xi; SK refers to the set of known SNPs; Tk is designated as the set of known traits; F indicates the kin-genome relationship; Λ represents the SNP-trait association and the side information from the public GWAS statistics. According to Equation 3.8, if calculating the marginal probability distribution directly, the running time will grow exponentially with the increasing number of variables in XU [61]. Since genome data is a kind of Big Data, with around 5 million SNPs per human genome, it is impossible to integrate enormous SNPs for comprehensive calculation. To improve the feasibility and efficiency of computational calculation, we use belief propagation on the factor graph. Belief propagation, also termed as sum-product algorithm, is powerful for inferences on prob- abilistic graph models, such as Bayesian Networks and Factor graph. Belief propagation allows factorizing the joint probability distribution into local functions, each function having a subset of variables as arguments. Thus, belief propagation changes the computation time from exponential complexity to linear complexity. Therefore, by performing belief propagation, the probability distribution p(XU |SK ,TK ,F, Λ) can be factorized in to products of local functions, of which each takes a limited number of neighboring SNPs and traits as arguments: " # 1 Y Y p(X |S ,T ,F, Λ) = f j(si , sF , sM ) U K K Z i j j j i∈F j∈S " # (3.9) Y Y Y i i i × gjk(sj, tk) i∈F j∈S k∈T 1 where Z represents a normalization coefficient. Figure 3.4 shows the convention from calculating the MPD directly with exponential complexity to that with linear complexity. Here, belief propagation allows factorizing the joint probability distribution into local functions, which considers a limited number of neighbor SNPs and traits as variables. In belief propagation, messages iteratively passing between variable nodes and factor nodes.

20 Figure 3.4: Belief propagation on factor graph

The ways messages propagate from a variable node to a factor node are different from those from a factor node to a variable node. We denote µ as the messages from a variable node to a factor node, and λ as the messages pass from a factor node to a variable node. A message µ sent by variable node x to a neighbor factor node f is the multiplication of all messages that the variable node i receives from its connected factor nodes except for the factor node f. By contrast, a message λ from a factor node f to a variable node x is the product of all the incoming messages to the factor f and multiply the factor represented by f, then sum out all the variable nodes associated with the incoming messages. For clarity, we use the simple factor graph as shown in Figure 3.3 as an example to illustrate messaging passing using belief propagation on the kin genomic factor graph. The message of the SNP variable node s is its belief or probability of this SNP take the values among domain (n) (n) 1 3 {BB, Bb, Bb} respectively. The message µs→f (s1 )) from s1 to the factor node f1 in the n-th iteration, is given by: (n) 1(n) 1 Y (n−1) 1(n−1) µ (s )) = λ (s ) (3.10) s→f 1 Z f→s 1 1 3 f∈ne(s1)\f1 1 3 where Z is the normalization coefficient, ne(s1)\f1 represents all neighbours of nodes s1 except 3 1 3 1 1 node f1 . For this example, ne(s1)\f1 = {f1 , g11}. (n) 3 3 The messages λf→s(s1) propagates from the familiar factor node f1 to its neighbour SNP 3 th variable node s1 at the n iteration, is given by:

(n) 3(n) X 3 1 2 3 Y λf→s(s1 ) = f1 (s1, s1, s1) µm→f (m) (3.11) 1 2 3 3 s1,s1 m∈ne(f1 )\s1

3 1 2 3 3 2 3 Note f1 (s1, s1, s1) ∝ p(s1|s1, s1) which characterize the Medellin inheritance. It is calculated based on the Table 3.1. 3 3 The message passes from a SNP variable node s1 to the SNP/trait association factor node g11 at

21 the nth iteration, is given by:

1 Y 3(n−1) µ(n) (s3(n)) = λ(n−1)(s ) (3.12) s→g 1 Z w→s 1 3 3 w∈ne(s1)\g11

1 3 3 where Z is a normalization coefficient. w ∈ ne(s1)\g11 represents the set of neighbour nodes of 3 3 3 3 3 s1 except for g11, and for this case in Figure 3.3, w ∈ ne(s1)\g11 = f1 3 3 The message passes from the SNP/trait association factor node g11 to the trait variable node t1 at the nth iteration is:

(n) 3(n) X 3 3 3 Y λg→t(t1 ) = g11(s1, g11) µk→f (k) (3.13) 3 3 3 s1 k∈ne(g11)\s1

3 3 3 3 3 Note g11(s1, g11) ∝ p(s1|t1) , which can be computed according to Table 3.3 and GWAS statistics. Regarding initiation, the belief propagation for a kin genomic factor graph begin at variable (1) nodes. We set uniform values for unknown variable values. µx→f = 1 for each potential values of x. Here,the unknown variable nodes include both SNP nodes and traits nodes: x = s ∪ t. (1) (1) (1) (1) (1) µs→f (x ) = 1. For observed variable nodes, if given x = v, then we set µs→f (x = v) = 1, (1) (1) 0 0 and µs→f (x = v ) = 0, where v 6= v . The algorithm stops until all the messages converged. The marginal probability of each unknown variable is the product of all incoming messages to it.

3.4 Sanitization of Kin Genomic Data

One objective of our research is to develop a data-sanitation approach that can achieve a balance between data utility and privacy. Differential privacy has shown to be a robust framework to ensure the privacy of released data [51]. Therefore, we aim to establish a privacy persevering kin-genome data publishing approach with a differential privacy guarantee. However, genome data is high- dimension Big data, the amount of noise added to the data will be enormous, and the work for data scientists to inject noise directly to the data will be overwhelming. As a result, the signal-to-noise is significantly reduced, when adding DP noise to all tuples following the conventional methods. Therefore, the massive amount of noise will, in turn, threaten the data utility. To solve this challenge, we proposed a data sanitation strategy to achieve a trade-off between data use and privacy. Figure 3.5 shows an example of data sanitization while satisfying differential privacy. The first step is to approximate the original genomic data by finding a set of low-dimension distributions representing the full distribution of the original familial genome data sets. Then the subsets of low-dimension distribution are sanitized by injecting differential-privacy noise. We use the private factor graph and belief propagation algorithm, as described in Section 3.3.2 and 3.3.3, to factorize the high-dimension global joint distribution into low-dimension local functions, which take a subset of variable as arguments. Since most SNPs are not associated with traits and not neighbors of target sensitive xi, which can be determined through the local functions and arguments they take, we

22 preserve them as they are for data utility. The sensitive target xi is defined by those assessed with "clinical significance", such as labeled with "likely benign" and "likely pathogenic" in SNP-NCBI and GWAS database [62]. Therefore, the SNPs that are sensitive and neighbors of sensitive target xi will be sanitized.

Figure 3.5: Example of addition of differential privacy noise

In this work, the neighbors of sensitive target xi are defined as those:

• are connected with xi directly through one SNP-trait association factor node, the local function of which take both as arguments

• are connected with xi directly through one familiar factor node

In this work, two strategies are proposed for data-sanitization. The first one is the addition of differential private noise to neighbors of target sensitive xi. The second one is by removing neighbor sensitive SNPs, whose sanitation will prevent prediction of xi. In terms of the removal of SNP for the sanitization method, it is similar to that in inference attacks, where some SNPs are hidden. Thus, the following part of this section focuses on the exponential mechanism-based DP noise injection. Kin genome datasets show genotypes of individuals of families, which are not numerical. Thus, to satisfy -differential privacy, we choose the Exponential Mechanism over the Laplace mechanism for data sanitation. Note, unlike the conventional methods which add differential private noise to all tuples, we only focus on sanitizing neighbors of sensitive SNPs or traits, avoiding the significant loss of data utility. To apply the exponential mechanism, group genotypes probability distribution is defined as our score function µ, as briefly explained in Equation 3.6. Compared with output randomized

23 genotype, using this score function allows us the sampling the output consistent with group genotype probability distribution. Therefore, adding or removing any tuples will not have a noticeable effect in aggregated statistics, persevering both aggregated statistics and an individual’s genetic privacy. Given minor allele frequency(MAF) for a population, we can get the genotypes probability distribution for this population, according to Hardy-Weinberg equilibrium. For simplicity, we assume only the two most common alleles for a SNP. One of the two is the dominant allele B with frequency p, and the other one is the minor allele b with frequency. Thus we have:

p + q = 1 (3.14)

(p + q)2 = p2 + 2p ∗ q + q2 = 1 (3.15) Then, we get the expected genotype frequencies, as shown in table:

Name Genotype Frequency Dominant homozygous BB = 0 p2 Heterozygous frequency Bb = 1 2pq Recessive homozygous bb = 2 q2

Table 3.4: Probability of genotypes distribution, given allele frequency

The genome database D consists of a number of genotype of BB, Bb, bb for SNP i of individuals. Using the following functions as utility functions ensure that the resultant database has a higher probability to output those with the larger frequency in original database. We set the utility score function: µ(D,BB) = p2 µ(D, Bb) = 2pq µ(D, bb) = q2 Thus, for function: D, R, µ: N |X| → R,  which guarantee -differential privacy, we obtain:

µ(D,r) exp( 2∆µ ) p[r] = (3.16) P µ(D,r0) exp( 2∆µ ) r0∈R where p(r) represents the probability of sampling r genotypes. Our work has used a robust differential privacy framework to ensure genetic privacy. This work has investigated the application of factor graph and belief propagation for the addition of local differential privacy noise, illustrated by Figure 3.5. In this way, we obtain an appropriate balance between utility and privacy of genetic data. Psudo-code for exponential mechanism based differential private release of associated SNPs is briefly described as following: Input: Original genome Data set D with the number N of SNPs to be sanitized, the privacy budget

24 , sensitivity ∆, the number of SNPs to be sanitized N Output: data set D0 with the number N of sanitized SNPs

1. For i ∈ S0 = {1, 2, ..., N} for SNP i:

2. retrieve its MAF qi

3. µ(D,R) ← Get-Utility-Score (qi) (Table 3.4)

4. For r ∈ R

5. P r(µ(D, si) = r) ← Get-Sample-Probability(µ(D, r))(3.16)

)µ(D,r) 6. Get-Sample-Probability(µ(D, r)) ← exp( 2N∆ )

7. sample SNP genotype with probability P r(µ(D, si) = r)

8. return N associated SNP with noisy genotype

3.5 Metrics for Kin Genomic Privacy and Utility

In this section, the privacy and utility metrics for kin genomic data privacy are described. The inference error is used to quantitatively measure kin genomic privacy loss due to the adversary’s inference attack. Entropy is used to estimate the uncertainty of the inference attacks. We also use these metrics to compare the difference in privacy loss before and after data sanitation for the same data sets and methods for experiments. In this way, it can evaluate the effect of data sanitation methods. Moreover, we expect to find the minimal amount of data to be sanitized to achieve the data utility, without changing probability distribution.

3.5.1 Inference Error The expected estimation error, inference error, can measure the distance between predicted results and ground truth of predicting xi. It also quantitatively reflects the power of the attacker’s inference attacks against kin genomic data. [63]:

X 0 Ei = P (xi|SK ,TK ,F, Λ)||xi − xi|| (3.17) xi

0 where xi is the inferred sensitive attribute values (trait or SNP), and xi is the actual attribute value from real world training data.

25 3.5.2 Entropy According to Shannon Entropy principle, the information entropy measures the uncertainty of inferring possible data values [64]. Therefore, the entropy of the possible value of target trait or SNP p(xi|SK ,TK ,F, Λ) can be used to measure the ambiguity of an attacker about his inference results. It also provides the measurement on the power of privacy preservation methods. Entropy is determined using the following formula: P p(x |S ,T ,F, Λ)log P (x |S ,T ,F, Λ) xi i K K i K K Hi = − (3.18) log(3) where xi is either the target unknown trait or SNP. The higher the entropy indicates, the greater the ambiguity of inference values, and the stronger the privacy preservation power.

3.5.3 Utility Regarding data utility, it is determined by the number of SNPs that being sanitized while the data sanitation method still satisfies -privacy. The less number of SNPs in datasets are sanitized the more utility preserves.

26 Chapter 4 Validation

In this chapter, we evaluate and validate our proposed methods for inference attack and data sanita- tion (previously described in Chapter 3). Experimental design such as programming environment set up, experimental data, and results of experiments are presented.

4.1 Experimental Design

4.1.1 Experiment Environment The programming language chosen for implementation is Python. The program is performed in Mac Operating System. Python libraries such as Pandas, NumyPy, Networkx [65], and pgmpy [66] are imported for implementation and experimenting.

4.1.2 Experimental Data To evaluate our proposed framework, two familiar genomic datasets are used:

• The Manuel Corpas’ family pedigree genomic data [67].

• The CEPH/UTAH Pedigree 1463 partial genomic data (over 8 k SNP) obtained from Deznabi et al [68], which is originally sequenced from the 1000 Genome Project [69].

Dr. Corpas, a biologist, shares his family quartet genomic data. The Corpus family tree is as shown in Figure 4.1. The original data file is in the Variant Call Format. We preprocess and clean the data by removing unused information such as chromosome number and position, and tuples that missing SNP ID or genotype results. The original CEPH/UTAH Pedigree 1463 has 17 family members [69]. The average number of children in a typical American family has been around 1.9 for the last two decades [70]. Considering the typical family household size nowadays, data availability and simplicity, only 11 family members’ genomic information of the CEPH/UTAH Pedigree 1463, as displayed in Figure 4.2, is used for this study. In addition to kin genomic data in SNP format, other data are also collected and used for our experiments. The MAF (minor allele frequency) of SNPs from Manuel Corpas’ family degree is extracted from the SNP database of NCBI and the 1000 Genome Projects [69]. For the reason that there is no ground truth for traits, we have simulated two traits for Cor- pus members and find its associated clinical significant SNPs: rs6025 and rs1801581. The SNP

27 Figure 4.1: Manuel Corpas Family Tree of the four family members. Females are represented by ellipses and males are represented as squares. rs6025 is significantly associated with venous thromboembolism, with a risk ratio of 3.57, risk allele frequency 0.063 [71]. The SNP rs1801581 is associated with aged related macular degener- ation(ARMD) [72]. Its risk allele frequency is 0.015375 [69].

4.2 Results

4.2.1 Inference Attack We randomly select 40 SNPs located at Chromosome 1 of Manuel Corpus family genome data. Then we hide all these 40 SNP of son as the target. We assume that we know the whole set genome of his mother for inference attack, then given both his parents’ SNP information for prediction. The calculation of inference error and entropy is determined using Equation 3.17 and Equation 3.18. respectively. The results are shown in Table 4.1, Figure 4.3 and Figure 4.4. In the Table, 4.1, the column header with "Inference Error" and Entropy are results for inferring Son’s SNP given his parents’ genome by performing belief propagation on the factor graph; whereas "Inference Error*" and "Entropy*" are the evaluation of prediction of son’s target SNP given his mother’s genome information. With the more neighboring SNPs from relatives released, both the inference error and entropy decrease (Figure 4.3 and Figure 4.4). It implies that releasing more relative SNP can result in more privacy loss of family members, even though the targets’ genome is completely hidden. Similarly, we randomly select 40 SNP as target SNPs from the CEPU/UTAH 1463 (Figure 4.2 ). We assume that an attacker’s goal is to learn the 40 SNP of r1, given some of r1’s relatives. For conciseness, we only show the average inference error and entropy. Our results indicate that with more relatives’ genome information released, the inference error is decreased, thereby increasing r1’s privacy loss, as shown in Figure 4.5. Comparison of the inference error between releasing the target individual’s parents’ genome (r8+r9) and his children (r3+r4), it produces a smaller average inference error with the information of parents. It implies that publishing parents’ genome increase

28 Table 4.1: Inference attack results SNP ID Inference Error Entropy Inference Error* Entropy* rs79585140 0.75 0.94639463 0.75 0.94639463 rs75454623 0.5 0.94639463 0.5 0.94639463 rs144718396 0.5 0.94639463 0.5 0.94639463 rs75062661 0 0 0.5 0.630929754 rs146246821 0.5 0.94639463 0.5 0.94639463 rs11240779 0 0 0.5 0.630929754 rs6594027 0 0 0.5 0.630929754 rs11240780 0 0 0.5 0.630929754 rs147199422 1 0 0.5 0.630929754 rs7541694 0 0 0.5 0.630929754 rs7545373 0 0 0.5 0.630929754 rs9988021 0 0 0.5 0.630929754 rs60722469 0 0 0.5 0.630929754 rs142929357 0.5 0.630929754 0.5 0.630929754 rs7523549 0.5 0.630929754 0.5 0.94639463 rs149880798 0.5 0.630929754 0.5 0.94639463 rs6605067 0 0 0.5 0.630929754 rs2839 0 0 0.5 0.630929754 rs3748592 0 0 0.5 0.630929754 rs3748593 0.5 0.630929754 0.5 0.94639463 rs2272757 0.5 0.630929754 0.5 0.630929754 rs7522415 1 0 0.5 0.630929754 rs4970455 1 0 0.5 0.630929754 rs3748595 0 0 0.5 0.630929754 rs3828047 0 0 0.5 0.630929754 rs3748596 0 0 0.5 0.630929754 rs56262069 0 0 0.5 0.630929754 rs13302945 0 0 0.5 0.630929754 rs41285802 0.5 0.630929754 0.5 0.94639463 rs13303227 0 0 0.5 0.630929754 rs13303010 1 0 1 0.630929754 rs4970441 0 0 0.5 0.630929754 rs7549631 0.5 0.630929754 0.5 0.94639463 rs6605071 0 0 0.5 0.630929754 rs4970435 0 0 0.5 0.630929754 rs4970434 0 0 0.5 0.630929754 rs28705211 1 0 0.5 0.630929754 rs116147894 1 0 0.5 0.630929754 rs28687780 0.75 0.94639463 0.75 0.94639463 rs3892467 0.5 0.630929754 0.5 0.630929754 Average 0.325 0.24448528 0.525 0.709795973 29 Figure 4.2: CEPH/UTAH Pedigree 1463 of 11 individuals. Females are represented by ellipses and males are represented as squares. The family tree represents family relationships.

Figure 4.3: Evaluation of son privacy loss by inference error, given the genome of son’s different relatives more risk of revealing target’s genetic information. We use entropy for estimating the adversary’s uncertainty about his inference attack. As expected, with more relatives’ genome release, the entropy decreases in general in Figure 4.6. This trend is consistent with its effect on the inference error in Figure 4.6. Interestingly, we do observe higher entropy given genetic information of target’s parents and one child, compared to given that

30 Figure 4.4: Evaluation of adversary’s uncertainty for prediction of son, given the genomes of son’s different relatives of target’s parents. It suggests that, under some conditions, given more information, especially the information may contain some errors, it may increase an attacker’s uncertainty.

4.2.2 Data Sanitization We used the Manuel Corpas family genetic dataset for evaluating the effects of the addition of differential private (DP) noise. To be specific, the goal is to evaluate the exponential mechanism (EP) based on DP noise on genetic privacy-preserving. In this experiment, we set son as our target, and try to infer masked SNPs. First, we assume that the SNPs of his mother are released. Then we use the same methods as previous experiments for inference attack and evaluation. Then we performed EP, which samples 100 output for the corresponding SNP of the mother, as the evidence. Then we perform the same inference attack and evaluation. Finally, we compare the average of inference error and entropy under these two different conditions. The experiment results show that the addition of differential private (DP) noise through the exponential mechanism contributes to preserving kin genomic data privacy. The inference error is increased significantly when EP noise is injected in Figure 4.7. Moreover, with EP noise, the inference has higher entropy (Figure 4.8), implying higher uncertainty. These results are consistent with our expectation that DP is a potential tool for preventing genetic breaches.

31 Figure 4.5: Evaluation of r1 privacy loss by inference error, given the genome of r1’s different relatives

32 Figure 4.6: Evaluation of adversary’s uncertainty for prediction of r1, given the genomes of r1’s different relatives

33 Figure 4.7: Evaluation of DP noise in genetic privacy preservation by inference error

Figure 4.8: Evaluation of DP noise in genetic privacy preservation by entropy

34 Chapter 5 Conclusion

5.1 Summary of This Work

With the enormous benefit of genomic data sharing and the issue of privacy breaching through this data, privacy-preserving genomic data publishing has become an active and vital field of investigation. It is impossible to realize “absolute disclosure prevention" if the data utility is required [73]. Also, the nature of genome data, such as Big Data and high similarity among family, makes it challenging to achieve an effective balance between data openness and privacy. This work has presented an approach to construct novel genome models that consider the kin-genomic relationship in families. To promote privacy preservation, we investigate the problems: how to perform inference attacks and how to sanitize kin genomic data while achieving a balance between data privacy and utility. This work shows that performing belief propagation on factor graphs formulates an effective inference attack on kin genomic data. The present findings confirm that kin genome data are highly correlated. The increasing leaking of one’s relatives’ genetic information leads to the increased accuracy of prediction in one’s DNA. Furthermore, this work has revealed that differential privacy and the exponential mechanism can be used to obtain an appropriate balance between utility and privacy.

5.2 Challenge

One primary challenge is that there are some poor accuracy sequencing results in raw data, which can negatively affect inference and evaluation. For example, in rs147199422, the son has Bb SNP while both parents have BB genotype. It is inconsistent with Medellin’s inheritance. Similarly, it also causes issues for inference when sanitation is based on perturbation, but it helps increase an individual’s privacy.

5.3 Future Work

For developing a better model and approaches in the future, we may extend our work in the following aspects:

• Integrating linkage disequilibrium of different SNPs in the model. If taking LD into account, it probably allows an adversary to infer more hidden SNPs, even none of these are released by targets’ family members.

35 • Experimenting with a different mechanism for differential privacy in genomic privacy pro- tection.

• Use of parallel computing.

• Exploring application of neural network for genome data modeling and inference.

36 References

[1] Kris A Wetterstrand. Dna sequencing costs: Data from the nhgri genome sequencing program (gsp), 2019. [Online; accessed July 1, 2019].

[2] Embl-ebi annual scientific report 2017., 2017. [Online; accessed July 1, 2019].

[3] Vivien Marx. Biology: The big challenges of big data, 2013.

[4] George J Annas and Sherman Elias. 23andme and the fda. New England Journal of Medicine, 370(11):985–988, 2014.

[5] Bastian Greshake, Philipp E Bayer, Helge Rausch, and Julia Reda. Opensnp–a crowdsourced web resource for personal genomics. PLoS One, 9(3):e89204, 2014.

[6] Paul Wicks, Michael Massagli, Jeana Frost, Catherine Brownstein, Sally Okun, Timothy Vaughan, Richard Bradley, and James Heywood. Sharing health data for better outcomes on patientslikeme. Journal of medical Internet research, 12(2):e19, 2010.

[7] Steven Munevar. Unlocking big data for better health. Nature biotechnology, 35(7):684, 2017.

[8] Jane Kaye, Catherine Heeney, Naomi Hawkins, Jantina De Vries, and Paula Boddington. Data sharing in genomics—re-shaping scientific practice. Nature Reviews Genetics, 10(5):331, 2009.

[9] Annalisa Buniello, Jacqueline A L MacArthur, Maria Cerezo, Laura W Harris, James Hay- hurst, Cinzia Malangone, Aoife McMahon, Joannella Morales, Edward Mountjoy, Elliot Sollis, et al. The nhgri-ebi gwas catalog of published genome-wide association studies, tar- geted arrays and summary statistics 2019. Nucleic acids research, 47(D1):D1005–D1012, 2018.

[10] Zachary D Stephens, Skylar Y Lee, Faraz Faghri, Roy H Campbell, Chengxiang Zhai, Miles J Efron, Ravishankar Iyer, Michael C Schatz, Saurabh Sinha, and Gene E Robinson. Big data: astronomical or genomical? PLoS biology, 13(7):e1002195, 2015.

[11] Melissa Gymrek, Amy L McGuire, David Golan, Eran Halperin, and Yaniv Erlich. Identifying personal genomes by surname inference. Science, 339(6117):321–324, 2013.

[12] Eric E Schadt, Sangsoon Woo, and Ke Hao. Bayesian method to predict individual snp genotypes from gene expression data. Nature genetics, 44(5):603, 2012.

37 [13] Yaniv Erlich and Arvind Narayanan. Routes for breaching and protecting genetic privacy. Nature Reviews Genetics, 15(6):409, 2014.

[14] Frank Stajano, Lucia Bianchi, Pietro Liò, and Douwe Korff. Forensic genomics: kin privacy, driftnets and other open questions. In Proceedings of the 7th ACM workshop on Privacy in the electronic society, pages 15–22. ACM, 2008.

[15] OpenStax Anatomy and Physiology dna nucleotides, 2016. [Online; accessed March 25, 2019].

[16] J Craig Venter, Mark D Adams, Eugene W Myers, Peter W Li, Richard J Mural, Granger G Sutton, Hamilton O Smith, Mark Yandell, Cheryl A Evans, Robert A Holt, et al. The sequence of the human genome. science, 291(5507):1304–1351, 2001.

[17] Jean Marie Cornuet and Gordon Luikart. Description and power analysis of two tests for detecting recent population bottlenecks from allele frequency data. Genetics, 144(4):2001– 2014, 1996.

[18] Nathan A Baird, Paul D Etter, Tressa S Atwood, Mark C Currey, Anthony L Shiver, Zachary A Lewis, Eric U Selker, William A Cresko, and Eric A Johnson. Rapid snp discovery and genetic mapping using sequenced rad markers. PloS one, 3(10):e3376, 2008.

[19] David E Reich, Stacey B Gabriel, and David Altshuler. Quality and completeness of snp databases. Nature genetics, 33(4):457, 2003.

[20] Meredith Yeager, Nick Orr, Richard B Hayes, Kevin B Jacobs, Peter Kraft, Sholom Wacholder, Mark J Minichiello, Paul Fearnhead, Kai Yu, Nilanjan Chatterjee, et al. Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nature genetics, 39(5):645, 2007.

[21] Dale R Nyholt, Chang-En Yu, and Peter M Visscher. On jim watson’s apoe status: genetic information is hard to hide. European Journal of Human Genetics, 17(2):147, 2009.

[22] WG Hill and Alan Robertson. Linkage disequilibrium in finite populations. Theoretical and Applied Genetics, 38(6):226–231, 1968.

[23] William Ernest Castle. Mendel’s law of heredity. Science, 18(456):396–406, 1903.

[24] Joel N Hirschhorn and Mark J Daly. Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics, 6(2):95, 2005.

[25] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg, 2006.

[26] Lennart Svensson. Ml tutorial: Factor graphs, belief propagation and variational techniques, 2016. [Online; accessed March 30, 2019].

38 [27] Susan Walsh, Fan Liu, Kaye N Ballantyne, Mannis van Oven, Oscar Lao, and Manfred Kayser. Irisplex: a sensitive dna tool for accurate prediction of blue and brown eye colour in the absence of ancestry information. Forensic Science International: Genetics, 5(3):170–180, 2011. [28] Peter Kwok, Michael Davern, Elizabeth Hair, and D Lafky. Harder than you think: a case study of re-identification risk of hipaa-compliant records. Chicago: NORC at The University of Chicago. Abstract, 302255, 2011. [29] Bradley Malin. Re-identification of familial database records. In AMIA annual symposium proceedings, volume 2006, page 524. American Medical Informatics Association, 2006. [30] Latanya Sweeney, Akua Abu, and Julia Winn. Identifying participants in the personal genome project by name (a re-identification experiment). arXiv preprint arXiv:1304.7605, 2013. [31] William W Lowrance and Francis S Collins. Identifiability in genomic research. Science, 317(5838):600–602, 2007. [32] Manfred Kayser and Peter De Knijff. Improving human forensics through advances in genetics, genomics and molecular biology. Nature Reviews Genetics, 12(3):179, 2011. [33] Mathias Humbert, Erman Ayday, Jean-Pierre Hubaux, and Amalio Telenti. Addressing the concerns of the lacks family: quantification of kin genomic privacy. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security, pages 1141–1152. ACM, 2013. [34] Raphaël Mourad, Christine Sinoquet, and Philippe Leray. Probabilistic graphical models for genetic association studies. Briefings in bioinformatics, 13(1):20–33, 2011. [35] Lu Zhang, Qiuping Pan, Xintao Wu, and Xinghua Shi. Building bayesian networks from gwas statistics based on independence of causal influence. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 529–532. IEEE, 2016. [36] Yue Wang, Xintao Wu, and Xinghua Shi. Using aggregate human genome data for individual identification. In 2013 IEEE International Conference on Bioinformatics and Biomedicine, pages 410–415. IEEE, 2013. [37] Jonathan Marchini and Bryan Howie. Genotype imputation for genome-wide association studies. Nature Reviews Genetics, 11(7):499, 2010. [38] Bryan Howie, Christian Fuchsberger, Matthew Stephens, Jonathan Marchini, and Gonçalo R Abecasis. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature genetics, 44(8):955, 2012. [39] Augustine Kong, Gisli Masson, Michael L Frigge, Arnaldur Gylfason, Pasha Zusmanovich, Gudmar Thorleifsson, Pall I Olason, Andres Ingason, Stacy Steinberg, Thorunn Rafnar, et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nature genetics, 40(9):1068, 2008.

39 [40] Erman Ayday and Mathias Humbert. Inference attacks against kin genomic privacy. IEEE Security & Privacy, 15(5):29–37, 2017.

[41] Erman Ayday, Jean Louis Raisaro, Urs Hengartner, P Jack, Adam Molyneaux, and Jean- Pierre Hubaux. Towards privacy compliance in the management of raw genomic data. In 2013 USENIX Security Workshop on Health Information Technologies (HealthTech 2013), 2013.

[42] Yangyi Chen, Bo Peng, XiaoFeng Wang, and Haixu Tang. Large-scale privacy-preserving mapping of human genomic sequences on hybrid clouds. In NDSS, 2012.

[43] Ricardo A Baeza-Yates and Chris H Perleberg. Fast and practical approximate string matching. In Annual Symposium on Combinatorial Pattern Matching, pages 185–192. Springer, 1992.

[44] Aaron Johnson and Vitaly Shmatikov. Privacy-preserving data exploration in genome-wide association studies. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1079–1087. ACM, 2013.

[45] Roberto J Bayardo and Rakesh Agrawal. Data privacy through optimal k-anonymization. In 21st International conference on data engineering (ICDE’05), pages 217–228. IEEE, 2005.

[46] Shibiao Wan, Man-Wai Mak, and Sun-Yuan Kung. Protecting genomic privacy by a sequence- similarity based obfuscation method. arXiv preprint arXiv:1708.02629, 2017.

[47] Bradley A Malin. Protecting genomic sequence anonymity with generalization lattices. Meth- ods of information in medicine, 44(05):687–692, 2005.

[48] Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, and Muthuramakrishnan Venkita- subramaniam. l-diversity: Privacy beyond k-anonymity. In 22nd International Conference on Data Engineering (ICDE’06), pages 24–24. IEEE, 2006.

[49] Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 202–210. ACM, 2003.

[50] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.

[51] Cynthia Dwork. Differential privacy. Encyclopedia of Cryptography and Security, pages 338–340, 2011.

[52] Caroline Uhlerop, Aleksandra Slavković, and Stephen E Fienberg. Privacy-preserving data sharing for genome-wide association studies. The Journal of privacy and confidentiality, 5(1):137, 2013.

40 [53] Fei Yu, Stephen E Fienberg, Aleksandra B Slavković, and Caroline Uhler. Scalable privacy- preserving data sharing methodology for genome-wide association studies. Journal of biomed- ical informatics, 50:133–141, 2014.

[54] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empir- ical risk minimization. Journal of Machine Learning Research, 12(Mar):1069–1109, 2011.

[55] Shuang Wang, Noman Mohammed, and Rui Chen. Differentially private genome data dis- semination through top-down specialization. BMC medical informatics and decision making, 14(1):S2, 2014.

[56] Erman Ayday, Emiliano De Cristofaro, Jean-Pierre Hubaux, and Gene Tsudik. Whole genome sequencing: Revolutionary medicine or privacy nightmare? Computer, 48(2):58–66, 2015.

[57] Amalio Telenti, Erman Ayday, and Jean Pierre Hubaux. On genomics, kin, and privacy. F1000Research, 3, 2014.

[58] Zaobo He, Jiguo Yu, Ji Li, Qilong Han, Guangchun Luo, and Yingshu Li. Inference attacks and controls on genotypes and phenotypes for individual genomic data. IEEE/ACM transactions on computational biology and bioinformatics, 2018.

[59] Cynthia Dwork. The promise of differential privacy a tutorial on algorithmic techniques. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, D (Oct. 2011), pages 1–2.

[60] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pages 94–103. IEEE, 2007.

[61] Zaobo He, Yingshu Li, and Jinbao Wang. Differential privacy preserving genomic data releasing via factor graph. In International Symposium on Bioinformatics Research and Applications, pages 350–355. Springer, 2017.

[62] H Duzkale, J Shen, H McLaughlin, A Alfares, MA Kelly, TJ Pugh, BH Funke, HL Rehm, and MS Lebo. A systematic approach to assessing the clinical significance of genetic variants. Clinical genetics, 84(5):453–463, 2013.

[63] Z. He, J. Yu, J. Li, Q. Han, G. Luo, and Y. Li. Inference attacks and controls on genotypes and phenotypes for individual genomic data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, pages 1–1, 2018.

[64] Claude Elwood Shannon. A mathematical theory of communication. Bell system technical journal, 27(3):379–423, 1948.

[65] Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008.

41 [66] Ankur Ankan and Abinash Panda. pgmpy: Probabilistic graphical models using python. In Proceedings of the 14th Python in Science Conference (SCIPY 2015). Citeseer, 2015.

[67] Manuel Corpas. Crowdsourcing the corpasome. Source code for biology and medicine, 8(1):13, 2013.

[68] Iman Deznabi, Mohammad Mobayen, Nazanin Jafari, Oznur Tastan, and Erman Ayday. An inference attack on genomic data using kinship, complex correlations, and phenotype infor- mation. IEEE/ACM transactions on computational biology and bioinformatics, 15(4):1333– 1343, 2017.

[69] 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature, 526(7571):68–74, 2015.

[70] Erin Duffin. Average number of own children per u.s. family with own children 1960-2019, 2020. [Online; accessed Jan 24, 2020].

[71] John A Heit, Sebastian M Armasu, Yan W Asmann, Julie M Cunningham, Martha E Mat- sumoto, Tanya M Petterson, and Mariza De Andrade. A genome-wide association study of venous thromboembolism identifies risk variants in chromosomes 1q24. 2 and 9q. Journal of thrombosis and haemostasis, 10(8):1521–1531, 2012.

[72] Michael Cariaso and Greg Lennon. Snpedia: a wiki supporting personal genome annotation, interpretation and analysis. Nucleic acids research, 40(D1):D1308–D1312, 2012.

[73] Cynthia Dwork and Moni Naor. On the difficulties of disclosure prevention in statistical databases or the case for differential privacy. Journal of Privacy and Confidentiality, 2(1), 2010.

42