<<

arXiv:1904.00800v2 [cs.IT] 2 Apr 2019 ua eoei euneo uloie hsnfrom chosen . of set the member sequence in four a stored the is information are genome limbs the and Human to tissues different correlated form and highly evolve cells way The non- multiple of existence the on previo privac our machines. relied ensure to two. we contrast colluding to of in where us machine, powers sequencing solution, allows the one to solution only proportional using structured selected layered are fragmen This DNA site, of number coverag DNA the Then as collector. per defined data individuals, eac the the at of for indiv sequence depth non-target DNA simultaneously, one known multip th of sequenced a fragments which with confuse be add To in we to collector. individual, scenario, target going data sequencing are local pooled trusted sequences a a in rea at the sequencer, assembling done and sequen is machine, of sequencing ba which fragments a is the at done reading which is of solution, which process a the propose separating we on challenge, p of this violation address major To to a access sequences, have individuals’ i will of service who knowledge This companies public. sequencing the the to to service offloaded a as introduced first was oaaii ihteDprmn fCmue niern,Sh (email:[email protected] Engineering, Iran. Computer Tehran, of Technology, Department of the Ne with Holmdel, is Labs, Motahari Bell (email:[email protected]). Nokia with (email:ali.ghola USA. is Iran. Maddah-Ali Tehran, Ali Mohammad Technology, of University nomto nterltvsi agra el u othe to due well, T as [10]. [9], danger members family in th between and relatives similarities puts inherited diseases information the this particular on of for information disclosure rates The insurance Moreover, by the example drugs. increase disclosu for The used to maliciously [8]. be companies [7], can care [6], data similarly health this providers. grown of of have testing part data confidentia this genetic leading and of privacy a as the involving becoming well concerns is procedures, th the as data in [5], result, genomic massively As risen [4], a have decade As services past testing [3]. genetic [2], of therapeu and usage [1], diagnostic in procedures both purposes decision-making care health its for als by characterized an uniquely fact, genotyping. called be In is SNPs–that (SNPs). can Polymorphisms genome human individual’s among What variations Single alike. for percent are responsible 98 mostly than similar–more is very are genomes l hlm swt h eateto lcrclEngineerin Electrical of Department the with is Gholami Ali ucinlte ihntehmnbd scddi h DNA. the in coded is body human the within Functionalities Terms Index Abstract aigacs otegnm eunecnbnfi individu- benefit can sequence genome the to access Having rvt htu N eunig Structured A Sequencing: DNA Shotgun Private DAsqecn a ae uedmn ic it since demand huge a faced has sequencing —DNA DAsqecn,sognsqecn,privacy. sequencing, shotgun sequencing, —DNA .I I. { l hlm,Mhma l adhAi n ee blalMo Abolfazl Seyed and Maddah-Ali, Ali Mohammad Gholami, Ali ,C ,T G, C, A, NTRODUCTION } h eunei human in sequence The . ) [email protected]). ee Abolfazl Seyed rfUniversity arif Approach ,Sharif g, Jersey, w rivacy. often s idual, hus, ces, lity full sed ds, tic us re ts y, le h e e e e hr ahn called machine the a is Second sequence where fragment. the each determines in i.e. nucleotides fragments; received the reads efis eto httesqecn rcs ossso two of the consists is process while sequencing First desired machine, the as phases. that sequencing sequencer mention the a first by We using received knowledge individuals, privacy the genome of this limiting the kind set guarantee sequence to this a will going while of are we introduce we possible and leakage fact, we In prevented is mathematically. of paper is sequencing source this leakage which of a In in information. itself scheme a is sequence the process for sequencing the that privacy the providing in usag us the benefit on, constraint. will later seen sequencing sequenced be pooled individuals wii of as these Also, which separately. cost in genome the the case reduce the will poole This to are sequencer. individuals comparison the of used to sequenci set sent be a than and of together rather can genomes methodology, the sequencing this individual, pooled one In furthe [29]. time, to [28], and Also, [27], genome, costs machines. the the sequence sequencing to existing reduce dollars da the 1000 of to than couple a less thanks just of takes cost It effective. a time with and cost proced both sequencing be to the let available algorithms Assembling egh.Atrta,asqecn ahn ed hs frag- these reads various the machine with assembles sequencing fragments are a ments multiple method, that, into this After In broken lengths. [26]. is [25], genome [24], the sequencing DNA sequence. shotgun our is knows company before the Therefore, of data, access sequence. our the disclose the even to to we due company violat process, sequencing is sequencing privacy the of The way. beginning int different looked the a have sharing at in we of privacy paper, process of this issue the is In the to one’data purposes. due no research set sure for data make data published to a was in papers Th revealed those [15]–[23]. other all methods in and providing cryptographic objective privacy data. for by differential k-anonymity genomic solutions used of provided for have concept some exploration privacy, the data data used in addressi have privacy papers Some of of lot priv issue a the are to the There violation a [11]–[14]. is individuals disclosure its of diseas hand curing other in the useful on is and hand one in data genomic accessing od h rcsigpaewietedt olco a the has collector data unable the while is phase sequencer processing the wil the which we do in fact, to In methodology at privacy. aim provide a We to introduce individual. phases each two the of separating sequence the assembles reads, aigade oka h eunigpoeue erealized we procedure, sequencing the at look deep a Taking h otpplrmto nsqecn h hl genome whole the sequencing in method popular most The edn phase reading reads aacollector data obidtewoesequence. whole the build to tahari nwihtesequencer the which in sn h received the using , rcsigphase processing acy ure ng ng ed ys of es in o d e e s 1 r l . 2 ability. In other words, the reading phase which needs high we use another set of individuals which their sequences are tech machines is outsourced, and the processing phase which is known before hand to the data collector but unknown to the computational is done on a trusted local machine. To separate sequencer. The genomes of this set of individuals are also the two phases, we should make sure the data collector has collected by the data collector and their fragments are added more information in comparison to the sequencer. One of the to the pool. This set is of size K N and the individuals are ideas used in that direction is the usage of a set of individuals labeled from 0 to K 1 and are called∈ known individuals. The which their genome sequence is known a–priori to the data previous set, which the− aim is to sequence their genomes, are collector and unknown to the sequencer. the other idea is to called unknown individuals. use the finite field addition. Briefly, if we have two binary We referred to SNPs earlier as the main source of difference random variables and one of them has a uniform distribution, between human genomes. Although there are four types of their summation in binary field reveals no new information nucleotides, two of them can occur in every SNP position for of the non-uniform random variable; i.e. having the output all individuals, and this binary set in every position is known of this summation, does not change the distribution of the a–priori for the population. Also, for each SNP position, random variable in comparison to the prior distribution. With the allele occurring with more frequency in the population these two ideas, we are going to limit the information leakage is called the major allele and the one occurring with less at the sequencer as desired, while letting the data collector to frequency is called the minor allele. Considering this, the reconstruct the sequences. sequence of every individual can be characterized by a vector This problem is conceptually connected to the Shamir in 0, 1 N where N N is the total number of SNPs and { } ∈ sharing scheme [30]. In this scheme, a secret is partitioned 0 and 1 represent the minor and major alleles, respectively. to multiple parts, and each part is stored in a data base. This Moreover, we define the matrix X which contains the random partition is done in such a way that with a subset of the data variable Xm,n 0, 1 in its row m and column n that ∈ { } bases, the secret is reconstructed. In fact there is a threshold for indicates the allele for unknown individual m in SNP position the number of data bases where any subset with the number of n. Similarly, the matrix Y is defined for the known individuals. data bases equal or more than that, can reconstruct the secret, Keep in mind that the entries in X are unknown both at the and any subset with the number of data bases less than that sequencer and the data collector, but the entries in Y are threshold, receives no information about the secret [31]. Based unknown to the sequencer and known to the data collector, on this solution, there are many works providing solutions leading to an information gap between these two. [32], [33]. Let m,n and ˜k,n denote the set of fragments containing F F The rest of paper is organized as follows. The problem SNP position n N for the unknown individual m and known ∈ setting is provided in Section II. In Section III, an achievable individual k respectively. The data collector sends the set of scheme is introduced with the corresponding results. In Section M N K N fragments m,n + ˜k,n to the sequencer (see IV, a generalized version of the scheme is introduced with the m=1 n=1 F k=1 n=1 F resulting theorems and Section V concludes the paper and Fig. 1). LetS us defineS the randomS S variables αm,n , m,n and |F | introduces some future steps. α˜k,n , ˜k,n as the depth for SNP position n for the unknown|F individual| m and known individual k, respectively. Note that in the sequencing process, from each individual, II. PROBLEM SETTING there are a number of genomes provided for the data collector, We propose an architecture in which there is a trusted data so for most regions in the genome for one individual, there are collector and a sequencing machine (i.e. sequencer). also, multiple fragments containing the region. The sequencer reads there is a set of individuals that want their genome to be each SNP with a probability of error. As will be seen later, to sequenced privately, without leaking the sequence data to the lower the effect of reading error caused by the sequencer, we sequencer. There are M N individuals in this set and they should increase the coverage depth. The set of reads sent to ∈ are labeled from 0 to M 1. The data collector has the duty to the data collector by the sequencer is denoted by . gather the genomes of the− individuals in the set and pool their Sequencers have errors in reading bases. TheR probability fragments (the genome is sheared to fragments with various of error in reading a SNP in a fragment is assumed to be sizes) together and send this pool to the sequencer. Then, constant across all sequences and for all SNPs and is denoted the sequencer will read these fragments (reading phase) and by η (0, 1). More precisely, in the sequencer, for a fragment reports the resulting reads to the data collector. At last, the data of an∈ individual, and in a SNP, the probability that a 1 is read 0 collector, using the set of reads, assembles the sequences for all or vice versa is η, independent of the individual, the fragment, individuals (processing phase) and reports the results to them. and the SNP. To provide privacy, unlike conventional methods, we aimed Having Y as a side-information, the data collector maps M N R at separating the reading phase with the processing phase. In to the matrix Xˆ 0, 1 × using a function φ, i.e. fact, the sequencer has the duty to do the reading phase and the ∈{ } Xˆ = φ (Y, ) , data collector is used for the processing phase. Our objective R for privacy is to guarantee that the processing phase can not where Xˆ refers to an estimate of the matrix of SNPs for be done in the sequencer. unknown individuals (X). To separate the two phases, we should create an information The proposed scheme should be such that the following two gap between the sequencer and the data collector. To do this, conditions are satisfied: 3

assumptions in our scheme: Seq. Assumption 1: Every fragment is short enough to contain • no more than one SNP. Assumption 2: Every fragment is long enough that can • be correctly mapped to the , i.e. we can D. C. identify exactly from what region of the genome sequence they came from. These two assumptions are realistic. We should keep in mind that there are approximately 3.3 million SNPs in the . Comparing to the 3 billion length of the U. Ind. 1 U. Ind. M K. Ind. 1 K. Ind. K ··· ··· whole genome, it is concluded that the average distance between two SNPs is roughly 1000 base pairs [34]. Moreover, using short read alignment algorithms like Bowtie [35], it is Fig. 1. The block diagram of the proposed scheme in stage 1. First, the possible to assemble reads of length in the order of a couple fragments of some individuals (known and unknown) are collected by the data collector, then they are pooled and sent to the sequencers. of hundreds. Thus using such algorithms, and choosing the fragments lengths to be about few hundreds, both assumptions are valid simultaneously. Seq. In the proposed achievable scheme, we focus on the case where S = M. In cases where M is greater than S, we partition the set of individuals into groups of size S and use this scheme for each group separately. In this paper, Sequence information we propose a specific assignment scheme for the coverage D. C. of known individuals depth parameters. In the proposed solution, named structured scheme, for m [0 : M 1], k [0 : K 1], and n [N] we have ∀ ∈ − ∀ ∈ − ∀ ∈

α =2mα , (3) U. Ind. 1 U. Ind. 2 U. Ind. M m,n 0 ··· and

k Fig. 2. In stage 2, each sequencer sends the results of the reads to the data α˜k,n =2 α0. (4) collector and then using the information of the known individuals, it will process the data and assemble the sequences of the unknown individuals. where α0 N. Also, entries in X have prior probabilities following the∈ major allele frequencies and entries in Y have

Reconstruction Condition: Let xn and xˆn denote the uniform prior probabilities. • column n of the matrix X and Xˆ respectively. The re- Keeping the coverage depth variables exactly as introduced construction condition requires that the inequality below in the above equations is practically impossible. They are hold for any given ǫ (0, 1): actually random variables. Analyzing the random case is rather ∈ complicated. To have a better understanding of the problem P(xˆn = xn) ǫ, n [N]. (1) 6 ≤ ∀ ∈ and make the analysis tractable, in this section, we consider ǫ is referred to as the accuracy level and is a design the constant case and later in Section Blah, we generalize the parameter. results to the case of random coverage depths. Privacy Condition: For privacy to be held, we want the First, we introduce the main results. Then we derive the • distribution of Xm,n,m [0 : M 1],n [N] remains mathematical models in the data collector and the sequencers almost the same before and∈ after reading− ∈ the fragments. in Subsections III-A and III-B, respectively. We rely on these To be precise, the privacy condition requires that the models to prove the main results in Subsections III-C and following inequality hold for any given β (0, 1): III-D. At last, we discuss the results in Subsection III-E. ∈ The following theorem provides a sufficient condition for I (Xm,n,m [0 : M 1],n [N]; ) ∈ − ∈ R β. (2) the reconstruction condition to hold. MN ≤ β is referred to as the privacy level and is a design Theorem 1. In the structured scheme with constant coverage parameter. depth and reading probability of error of η, the reconstruction In the following section we will introduce a proposed condition (1) is satisfied if scheme that satisfies the two conditions simultaneously. α 8η(1 η) 1 0 − ln . (5) III. STRUCTURED ACHIEVABLE SCHEME WITH CONSTANT 2M+1 2 ≥ (1 2η)2 ǫ − −   COVERAGE DEPTH The following theorem provides a sufficient condition for In this section, we propose a scheme to satisfy both the the privacy condition to hold. reconstruction (1) and privacy (2) constraints. We have two 4

Theorem 2. In the structured scheme with constant coverage Note that the i index refers to the read number. After scal- M 1 M+1 η − depth, the privacy condition (2) is satisfied if ing (9) and subtracting k=0 yk,n and 2 1 2η , (9) can − 1 be written as M . (6) P m M 1 2 α0 ≥ β 1 − Gn = X˜m,n,i + Y˜m,n,i The main message of these results is that we can choose the α0(1 2η) ! − m=0 i=1   parameters of the proposed scheme such that both conditions M 1 X X are satisfied, simultaneously. In other words, these theorems − M+1 η yk,n 2 . (12) confirm that the separation of the reading phase and the − − 1 2η k=0 − processing phase together with adding known individuals and X M 1 by adjusting coverage depths, offers enough flexibility to Note that subtracting k=0− yk,n in the above equation is fine, satisfy both conditions at the same time; based on (6), M because of the full knowledge of matrix Y is available at the P is chosen, and using (5), α0 is set. data collector. To follow, we derive the parameters of the random variable Example 1. If we assume the values η = ǫ = β =0.01, then X˜m,n,i on the condition of knowing Xm,n. Based on (10) we based on Theorem 2, we can have M 100 and based on have Theorem 1 for M = 100, α 9.6 10≥29 (or greater). Also 0 ≃ × for η = ǫ = β = 0.1, we get M = 10 and α 5300. For E X˜m,n,i Xm,n = Xm,n(1 η)+(1 Xm,n)η 0 | − − another example if we assume the values η =0.01≃, ǫ =0.001,   = (1 2η)Xm,n + η (13) and β =0.1, we will have M = 10 and α0 1166. ♦ − ≃ 2 Var X˜m,n,i Xm,n = E X˜m,n,i Xm,n | | A. Mathematical Model in Data Collector in the Structured     2 Scheme E X˜m,n,i Xm,n − | For any SNP position n [N], the data collector should be  2  2 ∈ = Xm,n(1 η)+(1 Xm,n) (η) able to estimate the vector xn = [X0,n,X1,n, ,XM 1,n]. − − ··· −  2  In this subsection, we seek for the model that the data ((1 2η)Xm,n + η) − − collector observes in SNP position n. We will show that the = η(1 η), (14) data collector receives Gn as − M 1 in which the last inequality is valid for both possible values − m of X ; i.e. and . Using the MMSE estimate and orthog- Gn = 2 Xm,n + Zn, (7) m,n 0 1 m=0 onality principle, we can write X 2 in which Zn 0, σ where X˜m,n,i = (1 2η)Xm,n + η + Zm,n,i, (15) ∼ N − 2M +1 2 η(1 η) where Z is a random variable with E Z σ2 , . (8) m,n,i ( m,n,i) = 0 − − 2 α (1 2η) and Var(Zm,n,i) = η(1 η). Also Zm,n,i and Xm,n are 0 − − To obtain this model, we should keep in mind that the uncorrelated. Consequently m m fragments have no tags and the data collector and sequencer 2 α0 m 2 α0 1 2 η Zm,n,i both do not know the corresponding individual which every X˜ =2mX + + i=1 . α (1 2η) m,n,i m,n 1 2η α (1 2η) fragment belongs to. Therefore, when the data collector re- 0 i=1 0 − X − P − ceives the read fragments from sequencer, the only information (16) α it gets is the number of major (or minor) alleles in every P 0 Z Based on central limit theorem i=1 m,n,i converges in dis- position n [1 : N]. Consequently, the data collector receives √α0 the following∈ summation tribution to a normal distribution with zero mean and variance η(1 η). Thus M 1 2mα − − α0 α0 X˜ + Y˜ , (9) Zm,n,i 1 Zm,n,i m,n,i m,n,i i=1 = i=1 (17) m=0 i=1 α (1 2η) √α (1 2η) √α X X   P0 − 0 − P 0 in which X˜m,n,i and Y˜m,n,i are noisy versions of Xm,n converges in distribution to a normal distribution with zero η(1 η) and Ym,n respectively, due to the reading error caused by mean and variance − 2 . Thus, the last term in the right- α0(1 2η) the sequencer. Also, recall that the data collector knows the hand side of (16) converges− to a normal distribution with zero sequence of known individuals a priori, i.e. it knows the value mean and variance 2mη(1 η). Similarly Using (11), we reach − for all Ym,n. Let us assume these values are Ym,n = yk,n. a similar equation. Therefore we have Consequently, using (16), we can rewrite (12) as (7).

Xm,n, w.p. 1 η X˜m,n,i = − (10) 1 Xm,n, w.p. η,  − B. Mathematical Model in Sequencer in Structured Scheme ym,n, w.p. 1 η Similar to the previous subsection, the sequencer receives Y˜m,n,i = − (11) 1 ym,n, w.p. η.  − the following summation in (9). The difference here with the 5 previous subsection is that all individuals are unknown form Based on the last two equalities the sequencer’s view point. Therefore, N I (X; )= I (xn; qn) . (25) ˜ Yk,n, w.p. 1 η R Yk,n,i = − (18) n=1 1 Yk,n, w.p. η. X  − Thus, for privacy condition (2) to be satisfied, it is sufficient ˜ Yet, Xm,n,i follows (10). for every n [N] to have Scaling the summation in (9), the sequencer receives q ∈ n I (x ; q ) defined as n n β. (26) m M ≤ M 1 2 α0 1 − To begin, we define Zn as qn , X˜m,n,i + Y˜m,n,i (19) α (1 2η) M 1 0 m=0 i=1 ! −   − m X X Zn , 2 (Xm,n + Ym,n) (27) Taking similar steps as in the previous subsection, qn is m=0 written as X It is concluded that the following Markov chain holds, M 1 − m q = 2 (X + Y )+ Z˜ , (20) xn Zn qn (28) n m,n m,n n → → m=0 X Thus we have ˜ 2 2 where Zn 0, σ in which σ is defined in (8). x x ∼ N I( n; qn) I( n; Zn). (29) ≤  In what follows, we seek for I x Z . We have C. Proof of Theorem 1 ( n; n) Having reached the mathematical model in the data collector I(X0,n, ,XM 1,n; Zn)= H(Zn) ··· − in (7), we provide the proof of theorem 1. H(Zn X0,n, ,XM 1,n). − | ··· − M 1 m (30) Proof. Note that the value of the summation m=0− 2 Xm,n uniquely matches to a xn (in binary representation of it, We expand Zn in binary formation P each entry corresponds to a Xm,n for different values of m). Zn = (BM,nbM 1,n b0,n)2. (31) Therefore, our objective is to find the summation above. The − ··· probability of error in estimating the summation, based on (7), Consequently, the following equations hold is simply upper bounded by X0,n + Y0,n =2B1,n + b0,n, (32) dmin X + Y + B =2B + b , (33) P(error) Q . (21) 1,n 1,n 1,n 2,n 1,n ≤ 2σ .   . Obviously, here dmin =1 due to the fact that Xm,ns are chosen from the set 0, 1 . Thus XM 1,n + YM 1,n + BM 1,n =2BM,n + bM 1,n, (34) { } − − − − 1 1 in which in equation i, Bi+1 is the carry over of the left-hand P(error) Q exp − , (22) summation in binary field. Equivalently we have ≤ 2σ ≤ 8σ2     b ,n = X ,n Y ,n, (35) in which σ2 is defined in (8). 0 0 ⊕ 0 b ,n = X ,n Y ,n B ,n, (36) 1 1 ⊕ 1 ⊕ 1 . . D. Proof of Theorem 2 bM 1,n = XM 1,n YM 1,n BM 1,n. (37) Using the mathematical model in (20), we are ready to − − ⊕ − ⊕ − provide the proof of theorem 2. (31) yields Proof. The fact is that for the sequencers, is equivalent to H(Zn)= H(BM,nbM 1,n b0,n). (38) R − ··· qn, n [N] because fragments contain just one SNP and are We expand the right hand side of the above equality as grouped∀ ∈ based on their containing SNP position and in the H(BM,nbM 1,n b0,n)= H(b0,n)+ H(b1,n b0,n) group containing SNP position n, the information is stored in − ··· | qn. Thus we have + + H(BM,n bM 1,n b0,n). ··· | − ··· (39) N Y P (X )= P (xn qn) . (23) Based on (35) and the fact that entries of have uniform prior |R | n=1 probabilities, b has uniform distribution, so H(b )=1. Y 0,n 0,n For H(b1,n b0,n) we have Recall that xn, n [N] denotes the column n of X. Due to ∀ ∈ | independence of entries in X, we have (a) H(b ,n b ,n) H(b ,n b ,n,B ,n) = H(b ,n B ,n) 1 | 0 ≥ 1 | 0 1 1 | 1 N (b) P X P x = H(X ,n Y ,n), ( )= ( n). (24) 1 ⊕ 1 n=1 (40) Y 6 which also results in . Note that a is resulted from the fact Major Allele Frequency=0.7

1 ( ) ) 0.65 that B ,n is sufficient statistic for b ,n. Also (b) is resulted ,n

1 1 0 from (36). Similarly, all the terms in (39) result to 1 except b

the last term. Therefore, ··· 0.6 ,n 1

H(Zn)= H(BM,nbM 1,n b0,n) − − ··· M

= M + H(BM,n bM 1,n b0,n). (41) b | − ··· | 0.55

Based on (27), for the second term in the right hand side M,n

of (30) we have B ( 0.5

M 1 H − m H(Zn X0,n, ,XM 1,n)= H 2 Ym,n | ··· − m ! ) = =0 0.45 M 1 X M 1 n − − Z = H(Y )= 1= M. ; m,n n

m=0 m=0 x X X ( 0.4 (42) I 1 2 3 4 5 6 7 8 M Using the last two equalities and (30), we have Fig. 3. As seen in this figure, I(xn; Zn) is an increasing function with M I(xn; Zn)= H(BM,n bM 1,n b0,n) 1. (43) increasing . | − ··· ≤ The proof is complete. IV. STRUCTURED ACHIEVABLE SCHEME WITH RANDOM COVERAGE DEPTH In the previous section, we analyzed the problem for con- E. Discussion stant coverage depth; however, it is not a practical case because As it is seen from theorem 1, the minimum α0 needed we do not have exact control on the number of fragments. In to preserve the reconstruction condition, behaves exponential this section, we consider a more general case in which the with M. α0 is a noise-resistance parameter and as it becomes coverage depth parameters are random variables. We assume larger, the ratio of the fragments containing false reads con- them to be binomial variables and approximate them with nor- centrate to the probability of error in the reading phase (η); mal distribution. Therefore, for n [N], m [0 : M 1], ∀ ∈ ∀ ∈ − that is why increasing α0 helps to eliminate the noise term we have in (7). m m 2 αm,n 2 α , 2 σ . (44) Taking a deeper look at the procedure in the proof of ∼ N 0 α Theorem 2, we realize that we have created the binary field Similarly for n [N], k [0 : K 1], we have ∀ ∈ ∀ ∈ − addition in our scheme, as was desired. The bits bi,n that k k 2 α˜k,n 2 α , 2 σ . (45) derive form (35) to (37), are the result of binary field addition. ∼ N 0 α The addition is for two random variables where one of them Due to the fact that coverage depths mostly have large values, has uniform distribution, Yi,n, and the other, Xi,n,follows the we assumed that α N. 0 ∈ distribution of SNP position n. If the value of bi,n is given As the previous section, we introduce the results hereunder. alone, the results reveals no new information about Xi,n. After that, the mathematical model and the estimation rule are Thus these bits alone, are not leaking any information. So introduced in Subsections IV-A and IV-B. Then, the proof of we have created this kind of addition, thanks to adjusting the Theorem 3 is provided in Subsection IV-C. Following them, coverage depth values. From (7) it is concluded that the only we discuss the results in Subsection IV-D. bit leaking information in position n is BM,n which means The following theorem provides a sufficient condition to the binary field addition scheme is not working perfectly, but satisfy the reconstruction condition. we should remember that the problem addressed in this paper has its limitations that we should adapt to. Interestingly, the Theorem 3. In the all-but-one scheme, the reconstruction maximum entropy of this bit is 1 and this upper bound on condition (1) is satisfied if: the information leakage is independent of M. This aspect is α0 max e1,e2 , (46) very interesting and useful and results the average information ≥ { } 1 where leakage per bit to be at most M . Therefore by increasing M, this average decreases, so we can adjust M so that we reach 16η(1 η) M+1 1 e1 , − 2 2 ln , (47) the desired β. Note that based on our simulations, I(xn; Zn)= (1 2η)2 − ǫ −   H(BM,n bM 1,n b0,n) is an increasing function of M (see and  Figure 3)| and− tends··· to an ultimate value. So by increasing M, the information leakage per bit decreases with the rate of 1 , η2 1 M e , 16σ2 1+ (2M+1 2) ln . (48) not more. 2 α (1 2η)2 − ǫ s  −    7

Remark 1: For the privacy condition, Theorem 2 is valid Similar to the steps taken in Subsection III-A and as a result here as well. This will be discussed later in Subsection ??. of the central limit theorem and orthogonality principle

m 2 α0 1 A. Mathematical Model in Data Collector in the Structured X˜ conditioned on X α m,n,i m,n Scheme 0 i=1 X m m 2 η(1 η) In this subsection, we will show that the information the 2 ((1 2η)Xm,n + η) , − . (56) ∼ N − α0 data collector receives is the value in Gn which is written as   M 1 For the second term in (53) we have − m Gn = (2 + δm,n) Xm,n

m=0 ξm,n MX1 1 − E X˜m,n,i Xm,n k ˜ α |  + 2 + δk,n yk,n + Zn (49) 0 i=1 k=0 X X    ξm,n  1 ˜ = Eξ E X˜m,n,i Xm,n, ξm,n where δm,n and δk,n are normal random variables with zero m,n α¯ |  m 2 k 2 i=1 mean and variance 2 σ1 and 2 σ1 respectively, where X 1  2 = E ξ ((1 2η)X + η) =0. (57) σα ξm,n m,n m,n σ2 , . (50) α0 − 1 2   α0

2 Using the law of total variance we have Also, Zn (0, σ ) where ∼ N 2M+1 2 η(1 η) ξm,n σ2 , η2σ2 . (51) 1 ˜ − 2 − + 1 Var Xm,n,i Xm,n (1 2η) α0 α |  −   0 i=1 X In the pooled sequencing scenario, the sequencer will re-  ξm,n  1 ceive Gn, which is defined as = Eξ Var X˜m,n,i Xm,n, ξm,n m,n  α |  0 i=1 M 1 αm,n 1 − X ˜   ξm,n  Gn = Xm,n,i 1 α0(1 2η) ! Var E X˜ X , ξ − m=0 i=1 + ξm,n m,n,i m,n m,n X X  α0 |  M 1 α˜k,n i=1 1 − X + Y˜k,n,i (a)  1  α (1 2η)   = Var ξ ((1 2η)X + η) 0 k i=1 ξm,n m,n m,n − X=0 X α0 − M+1  2  2 2 η  ((1 2η)Xm,n + η) − . (52) = − Var (ξ ) η 2 ξm,n m,n − 1 2  (α0) − 2 ((1 2η)Xm,n + η) αm,n ˜ = σ2 . (58) Consider the random variable i=1 Xm,n,i conditioned − 2 α (α0) on Xm,n. We have P m E αm,n 2 α0 where (a) results from the fact that (ξm,n)= 0. Based on (58), (57), (56), (53) and due to the MMSE rule and the X˜m,n,i = X˜m,n,i i=1 i=1 orthogonality theorem we have X Xαm,n + X˜ . (53) αm,n m,n,i 1 ˜ αm,n η i=2mα0+1 Xm,n,i = Xm,n + + Zm,n, X α0(1 2η) α0 1 2η i=1   m − X − 2 α0 ˜ (59) It is trivial that the random variables i=1 Xm,n,i and αm,n m X˜m,n,i are independent conditioned on Xm,n. i=2 α0+1 m αm,n ˜ P 2 η(1 η) Also, the distribution of i mα0 Xm,n,i resembles that where Zm,n 0, − 2 . Using the same steps, for m =2 +1 α0(1 2η) P αm,n 2 α0 ∼ N − of − X˜ both conditioned on X . the data collector we have i=1 m,n,i P m,n   We define P α˜k,n 1 α˜k,n η , m Y˜k,n,i = yk,n + + Z˜k,n, ξm,n αm,n 2 α0. (54) α (1 2η) α 1 2η − 0 − i=1 0  −  X (60) Thus we have E (ξm,n)=0 and

Var ξ Var α σ2 . (55) 2kη(1 η) ( m,n)= ( m,n)= α where Z˜ , − 2 . k,n 0 α0(1 2η) ∼ N −   8

Therefore using (59) and (60), (52) can be written as C. Proof of Theorem 3 M 1 − α η Based on the mathematical model and estimation rule G = m,n X + n α m,n 1 2η presented in the previous subsections, we are ready to provide m=0 0  −  the proof of theorem 3. MX1 − α˜k,n η + yk,n + Proof. Similar to the proof presented in subsection III-C α 1 2η k 0 X=0  −  and based on the estimation rule in (70), our estimation 2M+1 2 η resembles an AWGN channel. In other words, if we estimate − + Zn′ , (61) M 1 m η − 2 (X + y ), then x˜ˆ is resulted accordingly. − 1 2  m=0 m,n m,n n − Thus, for the probability of error we have where P M 1 − ˜ 1 Zn′ , Zm,n + Zm,n . (62) P(error) Q M+1 2 2 m=0 ≤ 2 (2 2)σ + σ ! X   − 1 Thus 1 exp p − . (71) M+1 M+1 2 2 2 2 η(1 η) ≤ 8 ((2 2)σ1 + σ ) Z′ 0, − − (63)  −  n ∼ N α (1 2η)2  0 −  Putting the right-hand side less than ǫ results For the fraction αm,n we can write it as α0 1 (2M+1 2)σ2 + σ2 . (72) α ζ − 1 ≤ 8 ln 1 m,n =2m + m,n , (64) ǫ α α 0 0 Rewriting the left-hand side by substituting σ2 results where 1 M+1 2 2 m (2 2)σ + σ = ζm,n , αm,n 2 α0. (65) − 1 − 2M+1 2 η(1 η) ζm,n 2 2 2 Therefore Var(ζm,n) = Var (αm,n) and for δm,n , we − − + (η + (1 2η) )σ1 . (73) α0 (1 2η)2 α − have −  0  In order (72) to hold, it is sufficient for both two terms in the Var (αm,n) m 2 Var (δm,n)= =2 σ . (66) 1 2 1 right-hand side of the above equality to be less than 1 . (α0) 16 ln( ǫ ) From the first inequality we have Also, δ˜k,n is defined similarly. Using (61), (64), and (66), (49) is resulted from (52) 16η(1 η) 1 α − 2M+1 2 ln . (74) 0 ≥ (1 2η)2 − ǫ   B. Estimation Rule −  From the second one we reach For any SNP position n [N], the objective for ∈ the data collector is to estimate the vector xn = 2 2 η M+1 1 T α0 16σ 1+ (2 2) ln . (75) [X1,n,X2,n, ,XM,n] . We define the extended vector ≥ α (1 2η)2 − ǫ ··· T s     x˜n , [X1,n, ,XM,n,y1,n, ,yK,n] , where the last K − entries are known··· to the data collector.··· Therefore, for the data As both inequalities above should hold, the theorem is proven. collector, estimating x˜n is equivalent to estimating xn. In this section, our objective is to find the rule that should be used by the data collector to estimate x˜ . Using the ML n D. Discussion rule, the estimate x˜ˆn is obtained by First of all, if we put σ = 0 in Theorem 3, the result x˜ˆ = argmax P(G x˜ ) α n x n n ˜n | resembles that of Theorem 1 which was expected. Also, as it M 1 is seen from 3, by increasing M and decreasing ǫ, e1 grows − m = argmax P G 2 (X + y ) x˜ , much faster (quadratic) than e2. So for small enough σα, e1 x n m,n m,n n ˜n − | m=0 ! is probably the bigger value. X (67) Remark 2: In this remark we will show that theorem 2 Let works in the random case of coverage depth too. Similar to the M 1 − m steps taken in the Subsection IV-A, the sequencer will receive Vn , Gn 2 (Xm,n + ym,n). (68) − qn in SNP position n [N] such that m=0 ∈ X M 1 Based on (49), − q = (2m + δ ) X M+1 2 2 n m,n m,n Vn conditioned on x˜n 0, (2 2)σ + σ . (69) ∼ N − 1 m=0 MX1 Therefore,  − k + 2 + δ˜k,n Yk,n + Z˜n (76) x˜ˆn = arg min Vn . (70) k x =0 ˜n | | X   9

where δm,n and δ˜k,n are normal random variables with zero [2] C. G. Van El, M. C. Cornel, P. Borry, R. J. Hastings, F. Fellmann, 2 2 2 S. V. Hodgson, H. C. Howard, A. Cambon-Thomsen, B. M. Knoppers, mean and variance σ . Also, Z˜n (0, σ ) where σ is 1 ∼ N H. Meijers-Heijboer, et al., “Whole-genome sequencing in health care,” defined in (51). We can write qn as European Journal of Human , vol. 21, no. 6, p. 580, 2013. [3] E. S. of Human Genetics, helps diagnosis M 1 and reduces healthcare costs for newborns in intensive care, 2018. − m qn = 2 (Xm,n + Ym,n)+ Zˆn, (77) https://medicalxpress.com/news/2018-06-genome-sequencing-diagnosis-healthcare-newborns.html. m=0 [4] S. D. Grosse and M. J. Khoury, “What is the clinical utility of genetic X testing?,” Genetics in Medicine, vol. 8, no. 7, p. 448, 2006. M+1 2 2 [5] A. S. of Clinical Oncology et al., “American society of clinical oncol- where Zˆn 0, (2 2)σ + σ . ∼ N − 1 ogy policy statement update: genetic testing for cancer susceptibility,” From (77) the Markov chain in the proof of Theorem 2 is  Journal of clinical oncology: official journal of the American Society of valid here as well. Continuing the same steps, we conclude Clinical Oncology, vol. 21, no. 12, p. 2397, 2003. that Theorem 2 works here. Therefore, all the discussions in [6] M. R. Anderlik and M. A. Rothstein, “Privacy and confidentiality of genetic information: what rules for the new science?,” Annual review of that scenario is still valid here. and human genetics, vol. 2, no. 1, pp. 401–433, 2001. [7] M. White, Why you should be scared Remark 3: All the results driven are for the case of Haploid of someone stealing your genome, 2013. cells. In this case, there is one set of chromosomes. But https://psmag.com/environment/why-you-should-be-scared-of-someone-stealing-your-genome-58082. [8] C. Heeney, N. Hawkins, J. de Vries, P. Boddington, and J. Kaye, in the case of Diploid cells, each cell carries two sets of “Assessing the privacy risks of data sharing in genomics,” Public health chromosomes. It means that in every position in the genome, genomics, vol. 14, no. 1, pp. 17–25, 2011. there are two chromosomes covering it. To extend our results [9] B. M. Knoppers, “Genetic information and the family: are we our brother’s keeper?,” Trends in biotechnology, vol. 20, no. 2, pp. 85–86, to the case of Diploid cells, we can assume each individual 2002. contains the chromosomes of two haploid-celled individuals. [10] M. Parker and A. M. Lucassen, “Genetic information: a joint account?,” So all the results are tailored to the case of Diploid cells if Bmj, vol. 329, no. 7458, pp. 165–167, 2004. [11] J. Kaye, “The tension between data sharing and the protection of privacy we replace M with 2M for the M-individual scheme. in genomics research,” Annual review of genomics and human genetics, vol. 13, pp. 415–431, 2012. [12] J. E. Lunshof, R. Chadwick, D. B. Vorhaus, and G. M. Church, “From V. CONCLUSION AND FUTURE STEPS genetic privacy to open consent,” Nature Reviews Genetics, vol. 9, no. 5, p. 406, 2008. In this paper, we introduced the problem of privacy in the [13] B. Malin and L. Sweeney, “How (not) to protect genomic data privacy process of DNA sequencing. Previously, the privacy criterion in a distributed network: using trail re-identification to evaluate and de- sign anonymity protection systems,” Journal of biomedical informatics, was inspected in genomic data sets, but their concern of vol. 37, no. 3, pp. 179–192, 2004. privacy is very different in comparison to our perspective. [14] W. W. Lowrance and F. S. Collins, “Identifiability in genomic research,” We seek to satisfy privacy in the process of sequencing Science, vol. 317, no. 5838, pp. 600–602, 2007. [15] L. Sweeney, “k-anonymity: A model for protecting privacy,” Interna- that enables to hide the DNA sequence from the sequencing tional Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, machine, while letting us to construct the sequence in a local vol. 10, no. 05, pp. 557–570, 2002. processor that is trusted. Previous approaches’ concern was [16] C. C. Aggarwal, “On k-anonymity and the curse of dimensionality,” in Proceedings of the 31st international conference on Very large data briefly how to make genomic data ready for announcement in bases, pp. 901–909, VLDB Endowment, 2005. a way that the information of no single individual is violated, [17] A. Johnson and V. Shmatikov, “Privacy-preserving data exploration in so one can see how our approach is different. genome-wide association studies,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data In this paper, we aimed to theoretically define the problem mining, pp. 1079–1087, ACM, 2013. of privacy in DNA sequencing and introduce an achievable [18] S. E. Fienberg, A. Slavkovic, and C. Uhler, “Privacy preserving gwas scheme so that it can satisfy our constraints if parameters data sharing,” in Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, pp. 628–635, IEEE, 2011. are adjusted correctly. We used non-colluding sequencers and [19] J. Kaye, C. Heeney, N. Hawkins, J. De Vries, and P. Boddington, “Data distributed the genome data between them. Also, we used sharing in genomics–re-shaping scientific practice,” Nature Reviews the idea of pooled sequencing and combined our the real Genetics, vol. 10, no. 5, p. 331, 2009. [20] Y. Erlich and A. Narayanan, “Routes for breaching and protecting data with known sequences. By setting the number of known genetic privacy,” Nature Reviews Genetics, vol. 15, no. 6, pp. 409–421, sequences and the coverage depth of sequences, we can satisfy 2014. the constraints. [21] M. Z. Hasan, M. S. R. Mahdi, M. N. Sadat, and N. Mohammed, “Secure count query on encrypted genomic data,” Journal of biomedical As this is the first paper in this problem, there can be done informatics, vol. 81, pp. 41–52, 2018. a lot in future works. For instance, The case in which a set [22] R. A. Miller, L. Dunlap, K. Park, J. Muller, and D. Shell, “System and of sequencers are collaborating could be concerned, or the method for privacy-preserving genomic data analysis,” Apr. 10 2018. US Patent 9,942,206. case in which fragments are not limited to contain just one [23] M. Kantarcioglu, W. Jiang, Y. Liu, and B. Malin, “A cryptographic SNP. Also, the lower bounds in the theorems in this paper can approach to securely share and query genomic sequences,” IEEE be improved. At last, we hope this paper has paved the way Transactions on information technology in biomedicine, vol. 12, no. 5, pp. 606–617, 2008. towards privacy in the process of sequencing. [24] J. Messing, R. Crea, and P. H. Seeburg, “A system for shotgun DNA sequencing,” Nucleic acids research, vol. 9, no. 2, pp. 309–321, 1981. [25] J. C. Venter, M. D. Adams, G. G. Sutton, A. R. Kerlavage, H. O. Smith, REFERENCES and M. Hunkapiller, “Shotgun sequencing of the human genome,” 1998. [26] A. S. Motahari, G. Bresler, and D. N. Tse, “Information theory of DNA [1] K. A. Phillips, J. R. Trosman, R. K. Kelley, M. J. Pletcher, M. P. shotgun sequencing,” IEEE Transactions on Information Theory, vol. 59, Douglas, and C. B. Weldon, “Genomic sequencing: assessing the health no. 10, pp. 6273–6289, 2013. care system, policy, and big-data implications,” Health affairs, vol. 33, [27] D. J. Cutler and J. D. Jensen, “To pool, or not to pool?,” Genetics, no. 7, pp. 1246–1253, 2014. vol. 186, no. 1, pp. 41–43, 2010. 10

[28] C.-C. Cao and X. Sun, “Combinatorial pooled sequencing: experiment design and decoding,” Quantitative Biology, vol. 4, no. 1, pp. 36–46, 2016. [29] A. Najafi, D. Nashta-ali, S. A. Motahari, M. Khani, B. H. Khalaj, and H. R. Rabiee, “Fundamental limits of pooled- sequencing,” arXiv preprint arXiv:1604.04735, 2016. [30] A. Shamir, “How to share a secret,” Communications of the ACM, vol. 22, no. 11, pp. 612–613, 1979. [31] S. D. Gordon and J. Katz, “Rational secret sharing, revisited,” in International Conference on Security and Cryptography for Networks, pp. 229–241, Springer, 2006. [32] J. Halpern and V. Teague, “Rational secret sharing and multiparty computation,” in Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pp. 623–632, ACM, 2004. [33] J. K. Arbogast, I. B. Sumner, and M. O. Lam, “Parallelizing shamir’s secret sharing algorithm,” Journal of Computing Sciences in Colleges, vol. 33, no. 3, pp. 12–18, 2018. [34] H. Shen, J. Li, J. Zhang, C. Xu, Y. Jiang, Z. Wu, F. Zhao, L. Liao, J. Chen, Y. Lin, et al., “Comprehensive characterization of human genome variation by high coverage whole-genome sequencing of forty four caucasians,” PLoS One, vol. 8, no. 4, p. e59494, 2013. [35] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, “Ultrafast and memory-efficient alignment of short dna sequences to the human genome,” Genome biology, vol. 10, no. 3, p. R25, 2009.